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(57) A network protocol and Interface using direct deposit mess aging provides low overhead communication 
in a network of mutti-user computers. This system uses both sender-provided and receiver-provided 
information to process received messages and to deposit both data and control infomnation directly where 
they are needed: data in memory and control information in conditionally/optionally interrupting a host 
processor. Message processing is separated into data delivery, which bypasses the host processor and 
operating system, and message actions which may or may not require host processor interaction. In this 
protocol a message Includes an indication of the operation desired by the sender, an operand specified by the 
sender and an operand which refers to some information stored at the receiver. The receiver ensures that the 
desired action is permitted and then, if the action is permitted, performs the action according to both the 
operand specified by the sender and the state of the receiver. The action may be message delivery, wherein 
the operands in the message specify values for use in various addressing modes including direct, indirect 
post-increment and index modes. The action may also be conditionally generating an Interrupt, wherein the 
operands are used, in combination with the receiver state, to determine whether a message requires 
Immediate or delayed action. The action may also be an operation on a register in the network interface or on 
other information stored at the receiver. The network interface and protocol are intended for use with 
local-area networks. Specializations of this Interface and protocol are particularly applicable to asynchronous 
transfer mode (ATM) networks. The network interface includes endpoints which may be nested and 
overlapped, address registers which may be organized into windows which may be nested and overlapped, 
address register protection and integration of exception handling and flow control. 
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COHPUTER NETWORK INTERFACE AMD INTERFACE PROTOCOL 
This invention relates to computer network interfaces and 

protocols and more particularly to such interfaces and 

protocols for low overhead communication. The invention is 

particularly applicable to asynchronous transfer mode (ATM) 

networks and local-area networks (LANS). 



A communication system is a significant part of any 
modern computer system. ' A fundamental characteristic of any 
such communication system is the communication overhead. 
Such overhead determines the kinds of applications that can 
be exploited efficiently. Low overhead communication (low 
latency and low impact on a host as defined below) is 
particularly important in parallel, distributed, or real-time 

computing systems. 

in general, two fundamental properties in communication 
systems contribute to overhead. The first is data delivery 
and the second is message action. Specifically, data 
delivery requires addressing at the receiver and message 
action requires interrupts to invoke message action at the 
receiver. Thus, communication involves both data delivery, 
transferring data from a sender to a receiver, and message 
action. invoking some special action. such as 
synchronization, on arrival of data at the receiver. Message 
action is often requested by interrupting the processor at the 
receiver . 

Com.-nunication systems can be classified according to the 
division of burden between sender and receiver. In a 
receiver-based system, information controlling data delivery 
and message action is localized to the receiver. In 
contrast, in a sender-based system, the sender plays a more 
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,,„ct role .y specifyin, inf oration in each ^«.,e tc 
control data delivery and »ess.,e action. The key 
,„eor..tion for data delivery is the destination .ddre.s o. 
the data. m systems usin, receiver-.ased addreas.n,. the 
.ource has no direct input on the anal .address o. a .essa.e 
. „essa,e identifies a huf.er .t th. receiver into -h.ch the 
„ess.,e-^-s-stored at_s.o^e.._i.plicit location, e.,. hy 

^^..'ae^ in systems using 
• - . Ts«ini-pr By contrast, m ^j" 
sequencing a pointer. 

sender-based addressing, the source specifies an address, 
contained in each ^ssa,e, indicating directly where the 
„essa,e should he stored at the receiver. Beceiver-hased 
.«.essin, involves si,nific.nt overhead in co»par.scn to 
sender-hased addressing. However, sender-hased address.n, 

raises protection issues. 

The .cey intonation for message action, is whether to 
generate an interrupt on message arrival. In systems using 
.eceive.-hased interrupts, an interrupt is generated hy the 
receiver on the arrival of every message. m systems using 
sender-hased interrupts, the sender specifies, hy information 
contained in each message, whether or not an interrupt should 
be generated on the arrival of that message. 

„„t conventional communication systems are 

« the typical ethernet network using 

teceiver-based. such as the typic. 

, M»st open, public local area networks 

the Internet protocol. Most open, p 

use this kind of protocol as well. 

several systems use purely sender-based addressing . 
e^ample. there is a system called ••Hamlyn,- of which an 
...lamentation in hardware is discussed in "Hamlyn: ^ 
interface for Sender-Based con^unication.- by aohn Wilkes 
^^^^^^^.^^^ „P..0SK-«-13. Hewlett-Packard - 

z:::r^o.. difficulty with the 
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overview, with few implementation specifics. This system 
includes sufficient protetctlon mechanisms for a multi-user 
LAN environment, but the published work does not specialize 
it. to any network, only an unspecified "private multicomputer 
interconnect." similar to others for" parallel machines. 

Another system similar to Hamlyn is described in 
"Efficient Support for Multicomputing on ATM Networks" by C. 
"Thekkath et~Vl.. T«.rhnical Report TR93-04-03. Dept. of 
computer Science and Engineering. Univ. of Washington. 
Seattle, Washington. April 12. 1993. This system a 
software-based emulation of purely sender-based addressing 
specialized to ATM LANs, designed for distributed system 
applications. Hardware support for sender-based addressing 
is not addressed by Thekkath et al . 

One probl-em:-wi-th syst-ems_whi_ch -use— purely . sender-based 
addressing is that in many cases the location at the receiver 
where data is to- be -placed should-be- dependent on the state 
of the receiver. For example, if incoming messages are to be 
queued at the receiver, the location of the end of the queue 
is dependent on the receiver state. To handle queueing in a 
purely sender-based addressing system, the sender must know 
t.he state of the receiver. Therefore, the sender must either 
keep track of this location, which becomes difficult when 
there is more than one sender sending to the same receiver, 
or use additional messages to determine the state of the 
receiver, which introduces atomicity issues. Either of these 
options increases both latency and impact on the host 
processors of the sender and receiver. Also, such extensiv 
k.nowledge of the receiver by the sender raises protecti 
problems. Similarly. sender-based interrupts also have 
problems. The sender based nature limits the possible 
interrupt operations. For example, if the interrupt status 



on 
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is only a function of the state maintained by the sender, it 
becon.es difficult to priority schedule interrupts at the 
receiver. 

TO overcome some of the communication overhead problems 
with either purely sender-based = or purely receiver-based 
coir^Kunication. many new parallel machines use variations of 
both sender-based and receiver-based addressing. For 
example, the" Meiko CS-2. of Waltham'.. Massachusetts, supports 
both traditional send/receive communication using 
receiver-based addressing and a remote read/write model using 
sender-based addressing. To support bulk data transfer, the 
CS-2. like many machines. has a co-processor for 
demultiplexing and Df» to memory. This D^^A usually has 
scatter/gather capability. though typically only with 
constant stride, and thus _f alls short of s_ender-based 
addressing. The sender-based and receiver-based addressing 
modes are combined mutually exclusively. This mutual 
exclusive combination is also true of other new parallel 
machines which use sender-based and receiver-based 
addressing. In the MIT Alewife machine and the Stanford 
FLAS.H machine, cache blocks for shared memory traffic use 
sender-based addressing, while bulk data transfers and 
send-receive traffic use receiver-based addressing. 
Generally, sender-based addressing is used for random access 
com.T.unication and receiver-based addressing is used for 
protected, cross-domain communication. 

some similar work which combines both sender-based and 
receiver-based addressing is a system which is described in 
••Active Messages: A mechanism for Integrated Communication 
and computation" in inr-i .«;vmnosium of Computer Architecture , 
pp. 256-266, May 1992. by T. von Eicken et al. In that 
system., the sender attaches to each message an address in the 
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„cewer of .n interrupt handler uhlch Is InvoXed upon 
„es»,e delivery to e«r.ct the .ess.^e from the net-or. .nd 
deposit or process the message .s desired. The -ness.^e .ay 
also include other .rgu.-nents. The interrupt handlers .re 
constrained in length and action ind are executed using the 
Host processor in the receiver address space so -that they run 
without the cost of a context switch to a ne- thread. A 
„ajor proMe» with this syste. is that it is restricted, at 
least without hardware support, to single user applications, 
because context switching is otherwise required. *n 
additional problen, with this system is that it also treats 
all messages at the receiver as requiring an interrupt and 
thus does not reduce the impact on the processor. 

By way of further background, William Bally presents in 
B,ny, «.o. :^t- ^i.^chit-ecture^f— ._«e_ssage-rriven 
P..„cessor-, in Tn-...»rional r ... rn-ium m Computer 
^^j,ir™. pa«.y, «... et. al^ "The Message Driven 

Processor, ;m Integrated Multicomputer Processing Element', 

■ ...n, t - lnrrrn.rinn.1 Conference 0, 

......... n...an: V , 1 -T ^ Processors. Cambridge, 

^ w 11 T£ 1992- and Dally* W.J- et al., 
Massachusetts, October 11-14. I'S-^. 

•Tine-Grain Concurrent Computing", in pp.n.rrh nirections in 

--- .^^^ Perspective, edited by Albert Meyer. 

KIT Press, 19,1. a message driven processor system, in which 
a non-conventional host processor executes communicat.on 
actions using built-in communication primitives without 
benefit of a separate networK interface. The primary 
aisadvantage of this system is this lac. of a separate 
networ. interface which would otherwise permit the use o. 
conventional processors. Protection issues, especially for 
user-to-user level communication, for a multi-user system are 
also not addressed. Additionally. this system is not 



S 
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ven-asaptea to AD. networks because the message format is 

'""'irr^ulti-user netvor)t environment, typified by a 
local-area network (L«.), ccn..unication is a global resource 
for which protection must be provided to isolate a user on 
one processor from accidental or malicious interference from 
..other user on another processor. ■ «so. if the nodes are 
„.Ui-user as well, protection muit also be provided to 
similarly isolate users from each other on the same 
processor. There has been very little work on achieving low 
overhead in this type of multi-user network and multi-user 
processor environment. 



- TO overcome thele problems and limitations with the prior 
art. this invention'provides a netwoVk protocol and interface 
for low overhead communication in a network of multiuser 
computers based on direct deposit messaging. Direct deposit 
„ess.5in5 signifies directly demultiplexing messages and 
depositing both data and control information directly where 
.hey are needed, e.g.. data in memory, and control in 
(conditionally) interrupting the host processor. m such a 
svstem. asynchronous events are controlled by separating 
events by their need for the host processor. Events, such as 
aata delivery, which do not require the host processor are 
handled directly, e.g., by depositing data directly into user 
memory. Events. chiefly synchronization. that regurre 
interaction with the host processor are divided into 
.mediate actions that require l^nediate service and 
aelavable actions that are accumulated and processed at some 
-ime convenient for the host processor, thereby turning them 
.-ro synchronous events. Kith this separation of data ano 
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convex eve..». .escu.ce= .u«.=ien. « ...ass n..-n=.. 

needed. . 

^i**. 1-hese events, the 

TO obtain effective separ.tion of these 

.eceive. processes a s,esss,e usin, information in the ^ss.,e 

..e senaer. The receiver ' action in response to . 

i-hi.; sender information and 
message depends both on this .senaer 

.messa_g __. f • «^ a messTge therefore 

^^^y.^ri ar the receiver. A inessoyc 
information stored at tne *■ . . 

* ooeration desired by the 

includes an indication of the operatio 

J ^r,or-ified bv the sender and' one 
sender, one or more operands specified by tn 

A ^K^rh refers to some state maintained by the 
or more operands which refers t:o 

■^>lo desired action is 
receiver. The receiver ensures that the desir 

^ rhPn if the action is permitted, performs the 
permittea and then, ii tne 

J, ^^^r-'if\e>A bv the sender and 
action according to the operand specified by t 

the state maintained by tTie-receiver. 

..e action performed by the " receiver may -be- message 

delivery, vheriin an operand in the message specifies values 

„se in various addressing modes, such as direct. 

indirect, post-increment and index modes. Pata is either 

bitten or read from such addresses directly, without host 

processor or operating, system intervention while maintaini^n, 

Lltiuser protection. The action may also be conditionally 

. .„ interrupt, wherein an operand is used in 
generating an interrupt. 

•,h the receiver state to determine whether a 
combination with the receive 

^3sage retires i^ediate or delayed action. The action may 
,„.olve special memory locations, called address reg St r^ 

. - ^ This network interface 
• ed in the network interface. This netwoi 

!r"specially useful for asynchronous transfer mode (««) 

:etworL in Which messages are compared of fixed si.e 

orirnitive data units called cells. 

' The interface design also supports a multi-ceal format in 
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rtata This enhancement 
and subsequent cells contain purely data. 

provides increased bandwidth. 

It is also possible to h.ve endpoints which overlap to 
,now controlled sh.rln, between endpoints in . flexible 
manner. Access to eddtess registers is United to a 
contiguous bloc, of registers called a -indcv. As «th 
endpoints, address register window, ».y be overlapped and 
■ „...ed With ^ other address register window, to allow 

^AAyecfi rfioister windows in a 
controlled sharing between address registe 

artriress register protection may als6 be 
flexible manner. Address regA»wc 

provided to restrict access of the sender to different 

"'""erous other enhancements n,ay be ..ade. including 
integrating exception handling and flow control, paging 
connection- and. endpoint. t.bles, adding Sl^bal. address 
registers, and using hybrid mapping to reduce translation 
look-aside buffer misses. 

It is also possible to provide higher level operations 
executable at the receiver as part of an instruction memory 
accessible by an instruction pointer found in the operation 
£ie^d of a message. Thus, the sender may be isolated from 
some Knowledge of the receiver. The full spectrum of 
operation of sender-based addressing and receiver-based 
addressing is thus provided. 

«ith this system, co-^unication overhead is reduced by 
allowing the sender to specify as much as possible about the 
intended action for the ^ssage. while still allowing the 
receiver to control message reception for protection and 

-i— Thus. both sender and 
receiver-dependent operations. Thus. 

-eceiver information is used to demultiplex messages directly 
to where they are needed, reducing latency. The processor at 
.^e receiver is involved only when synchronization is 
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-e^ired. That is. interrupts .re el Imnated for every 
.essaoe; an interrupt is generated only "hen a message 
-eguires i.n«ai.te action. Thus, impact on the processor .s 
reduced. This con^lnation of control t: .synchronous events 
and direct deposit messagin, provides nexibiUty and reduced 
overhead with both full protection .tad separation of control 
and data. This network interface and protocol is .pplicable 
-J^rois a wide range of networks and - across a wide range of 
applications in parallel. distributed. and real-time 
computing. 

in su^nary, a network protocol .nd interface using direct 
deposit mess.ging provides low overhead con^unic.tion in a 
network of n,ulti-user computers. This system uses both 
sender-provided and receiver-provided informetion to process 
received meslages^'ind "to deposit bo-ih data and control 
information directly where they are needed, dat. in memory 
and control information in -conditioially/cptionally 
-terrupting . host processor. Message processing is 
separated into d.ta delivery, which bypasses the host 
orocessor and operating system, and mess.ge actions which may 
0, may not require host processor interaction. In thrs 
protocol, a message includes an indication of the operation 
desired by the sender, an operand specified by the sender and 
an operand which refers to some information stored .t the 
receiver. The receiver ensures th.t the desired .ction .s 
permitted and then, if the action Is permitted, performs the 
action according to both the operand specified by the sender 
and t.'.e state of the receiver. The action may be mess.ge 
aeiwery, wherein the operands in the message specify values 
-o- use in v.rious .ddressing modes including direct, 
--..'.ect. post-increment .nd index modes. The .ction nay 
a,s= be conditionally generating .n interrupt, wherein the 
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operands .re used, in con^in.tion with the receiver sr.te, to 
deternine whether . message requires i^^ediate or delayed 
action. The action n,.y also be an operation on a register .n 
the network interface or on other information stored at the 
receiver. The network interface^ and protocol are intended 
,or use with loc.l-.rea networks. • Specialisations of th.s 
interface and protocol are particularly applicable to 
aa-ynchronous transfer mode (ATM, ' networks. The network 
interface includes endpoints which may be nested ^ and 
overlapped, address registers which may be organised into 
windows Which may be nested and overlapped, address register 
protection and integration of exception handling «.d flow 
control . 

The invention wiU be further- described by-way of non-1 i«itHtive 
e«.ple with reter^ to the accompanying drawings, in which:- 

1 is a block diagram of a typical conventional 
computer syste. with a con^unication syste. using 

receiver-based addressing; 

Pig. 2 is a flow chart describing a conventional 

cor.-nunicaticr. process for the computer system of Fig. 1; 

Pic 3 is a block diagram of a conventional computer 
system' with a con^unication system using sender-based 
addressing; 

Pig. 4 is a flow chart describing the operation of the 

computer system of Fig. 3; 

Pig. 5 is a block diagram of a computer system w.th a 

.or, in accordance with the invention using 
communication system in accorua 

direct deposit messaging; 

Pig. 6 is a flow chart describing t.he operation of the 

computer system of Fig. 5; 



, i. . bloc. aescribin, . suitable to.... 

3 .3 .™ cen « b. in »„e e«.oei.e« c. .bi= 

invention; one embodiment of 

Pig. 8 is a block diagram descrabang one emb 

^»rP in the communications system of the 
the network interface in tne 

, Pic 5 for use in AT« networks with the 
computer system of Fig. 5 tor 

cell format of Fig. 7; 

, i. a bloc, -i.gr.. o, the. receive side operation 
,„,ic ri,. . sHo.i.9 protection an. aaat.as ,.netatio„ 

portions; ^ . 

Fig. XO is a block diagram of an example 
the -eceive controller shown in Fig. 8; 

li, „ is a blocK diagram of a direct cache interface; 

^,g. .2 is a flowchart describing how an endpoint and 
connection based con»unication is established; and 

nc » is a blocK diagram of the receive side operation 
log.c of rig. 8 in a second e^^odiment of this invention. 



.ne present invention will be more completely understood 
..rough the following detailed description which should be 
lead in conjunction with the attached drawing .n wh 

reference numbers indicate similar 
...erences, including publications and patent appUcatxons^ 
Led above and following are hereby e:,ressly incorporated 

by reference . ^. , -nrt 

. rypic.l computer system is shown in r.g. 1. n 

^'^^ ^ first computer 
«*.i^e of illustration, a tirsi: ^ 
,„cluaes. for purposes > . ^^^^^^^ 

.hereinafter called the * 
(hereinafter called the receiver) 52- 

. rha- the reference to sender and receiver are used 
understood tha. the 

merely for ease of illustration. Tne 



II 
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(i.e., sender 50 end receiver 52). the syste. is 
■■""-■.■.ed to two computers. There »,y be multiple senders 
receiver, .u:tip.e receivers »d one sender or 

"■' ' . . ,nd receivers. Also, e.ch such con>p>:«r m>y 

f— ---Die senders ana receive-s. 

comprise interconnected processors. rin.Uy. the sender 
-eceiver „.y he interconnected processors within . s.n^U 
" , 4.^,. The term node as 

c-p-ater s«h .s one p.r.»el computer. The 

" A., receiver ■ Eech node is 

-se' he-ei- signifies any sender or receiver. 

i,..'... -.0 h.ve its o-r. virtual address space (i.e.. the 
..Z are ass.^ed to have virtual memory) distinct ano 
■;:r..,..e-.t fro. that in other nodes .The net-orX is 
.TT.,— see hy .uiTiiTe-^-noiw-o-operatir.? users a-n'd each 
-.v'.e used by .ore than user. Thus, protection 
are tlt-ically provided to protect agamst 
■.■^--e-.-.a: or malicious interference with a process of one 

,xa-=!e. there are nechanisns to 
.user 6v anither user. For exa.p'e. 

w ^i^^A ao-y^^TiQ of messages tc 
-e..e- one user from 1) un8ut.^orlIed se..a.n, 

.....e- user, 2. unauthorized access to memory of another 
Z'.. c- ^> trying to appear as another user to a receiver. 
-Z. mechanisms are co^only found in standard extensible 
Tetvorxs interconnecting multiple users and are often not 
'ou-.d in closed, proprietary networks. 

" .He sender so includes a processor » connected to the 
.....ork interface and pro,r.n»ed according tc a desire 
:,..inc system., illustrated at se. The operating system 
co:.puter program which manages node resources such 
:;-v. processor time and networ. access, arbitrates a..d 
::Z.s applications from e.ch other. and contrc.s 
::....c-'=-. between applications. s.ch as applications SB ...d 
nc. 1 and the processor 54. The c?er 

•2. 



^as associated .he»wi-.h .essa^e tubers « -nd » -Mch are 
se. « send messages across the ne«ot. interface to the 

receiver conpu.er 52- , tn 

^.ociated therewith respective .e.ory portions « 

.e cned endpoint buffer's or si»piy endpoints. 
Sin>iUrIV. receiver computer includes a processor 
.onnected to the net-orK interface B.. The processor is 
p.o=ran.ed accordin, to a desired operatin, syste. 7. as 
n:strated in Pi,. V. Si.iUr to the sender, the operating 
„ ,3 ,,„Wer 5. includes ™essa,e buffers an 

, . .ppucations ,B and SO aiso h oci.ted therewit 

portions 7. and Si. which aiso ».V te c.Ued endpoint 
bu'fers or simply endpoints. 

" I-, general the sender or receiver .ay include other 
processors.. .such_.s .a_ network .PO-.P.rocessor or other 
'co-processors. To re.cve arr^iguity. processors 54 and 70 are 
re'er-ed to as "host processors" herein. . 

" . .essaoe <shown at S3) in a conventional syste. includes 
, neader -hich is used hy the netvor. and the net.or. 
,„.erfaces S«, 8S to direct the .ess.^e to the appropriate 

node and endpoint. 

„ application, such as SS, at the sender 50 ana receiver 
co„..unicate by sending ^ssages to each other across the 
networ. ... The message .ay include any data, including a 

. Pi_ 2, coromunicataon 
^aii As shown in rig. *• 
procedure call. . as sn 

. »r least the following steps. First, 
conventionally involves at least tne 

..e application SB invoKes a send con^and in step S9 • - 

.a.ple way .or an application .3 to invoKe a send operation 

3 .V executing a co^and as depicted .y ..e -send- co„.an 

co..and includes an an identifier which is used 

■.art to identify ..e receiver, a source address fro. which 

^" aa^a w^U he taKen and a size, indicating the amount 

message da^a w.ii 
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of data to be sent. The operating system then copies the 
message in step 100 from application memory, such as an 
end.,oint 66. to message buffers, e.g.. 62. in the operating 
system 56 of sender 50. This step is often optimized by 
mapping locations'in the application memory to locations in 
the message buffer 62 to avoid actual copying. The operating 
system 76 " the7 pe;rforms protocol processing if necessary in 
step 102. That 'is. the data" to' be 'sent is placed into the 
proper format as may be required by the network 82. ^ The 
message, or perhaps several messages if the amount of data is 
large, is then injected in step 104 into the network 82 
through network interface 84. 

usually, message arrival at the receiver causes an 
interrupt to the processor 70. and the operating system 72 
directly extracts the message in step 106 from the network 
(meaning netwprk and network interface) and copies it into 
message buffer 74 in the operating system. Alternatively, in 
a system with appropriate hardware, the operating system may 
sec up a direct memory access (DMA) with the network 
interface to extract and copy the message to message buffer 



74 



in some communication schemes an intelligent network 
interface, or perhaps a second processor at the receiver 52 
(e.g.. the communication co-processor in the Intel Paragon) 
extracts the message from the network 82 and copies the 
message to the message buffer 74. The operating system 72 
then performs protocol processing, if necessary, in step 108. 
then cooies the message in step 110 from the message buffer 
,2 at the receiver 52 to application memory such as endpoint 
,9 usually the operating system 72 copies this data to 
acolication memory in response to an explicit receive request 
by an application, such as a "receive" command 77. The 
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„„,..a inc^u^es ID- wHUK is use. in ps« to specify 

ui^w is to be received, 

^ rjr ^ n 74 fTom which data is . 
message buffer, e.g.. 74, tro ^ . ^ 

„a . si« indicating the .«unt ■ of data. It is also 

.r^^m -y-? to automatically copy 
possible for the operating systen, 72 to 

aata accoxain, to so,. previously save. s ate 

,„.or.atiOh. .s in the senain, siae^this copyin, >s o.ten 

„p.i„i«a .y ».PPin, -cations in the appIication_.«.oty to 

no huffer 74 to avoid actual copying, 
locations xn message buffer 74 to 

. co^unication syste. has overhead which incluaes hch 
co^unication latency and impact on the processors 5, and ,0 

.he senaer .0 ana receiver .2. For sa.e of simplicity, 
con^unication latency ana communication overhe.a are referred 

« latency and overhead. Latency may be 
to herein merely as lat ency 

aefined as an amount \i time ta.en for a message to he 

,..„s£er:ea from application memory at the senaer 50, such as 

elapoint ... to application memory at the receiver 52, such 

„ enapoint Impact on the processor involves interrupt 

hanaiin,, data flo- control and protocol processin,^ 

overhead is reducea by optimizing the steps aescr.hed ahov 

•.K T^n 2 SO as to reduce latency and impact 
in connection with Fig. 2 so as t 

on the host processor. 

. ■ smn«rtant for applications which require 
Low overhead is important lor -ff 

.3pid response hehav.or, such as parallel and distribute 
computing systems and real-time control systems, m paral le 

inu latency is essential to reduce the 
computing systems, low latency 

r^rocess at a sender 50 waits for data to be 
amount of time a process at .^pUcation memory 

.ead from a remote memory location, e.g.. app 

and for remote synchronization operations <e.g^^ 

oLin^^c and releasing locKs) .o be completed. In 
obtain-. 1^ T pnr-server 

ais"ibuted computing systems, performance of a cUent se 
GiSw^^ouu^ recru'red to do a 

.ode: is Often limited by the amount of t.me re^. 



IS 



remote procedure call (RPC), which is affected by latency. 
The ir.portance of low latency is perhaps most obvious in 
real-time systems where an inordinate delay in corr-r.unicating 
a control input may lead to disaster. 

Even when low latency is not essential for a given 
application, it may increase the spectrum of possible 
applications and the flexibility in structuring a system. In 
parallel computing systems, lower, latency enables the 
efficient exploitation of more finely grained computations 
and thus increased parallelism. In a distributed computing 
system, sufficiently low latency may make paging, e.g. by 
sender 50, over a network 82 to memory in a remote node. e.g. 
receiver 52, faster than paging to a local disk (not shown). 
Finally. low latency -could help make cl ient^se'rver"' based 
co-puting systems attractive "for realizing flexible real-tip.e 
coniputing systems. 

Low impact on the host processor is important to minimize 
the degradation on applications due to reduced and 
unpredictable processor availability. Predictability is 
particularly important for real-time tasks performed by the 
host processor. It is important to insulate such 
azjplicetions from unrelated asynchronous network events. 

Current generation parallel computing systems with 
proprietary networks obtain latencies in the range of Ipsec 
to lOOpsec. It is desirable to have latency in a 
local-area network be no more than 1000 cycles, which for 
future 1GH2 processors, is lysec. In conventional 10Mbps 
Ethernet LANs, latency is typically about Imsec. First 
generation 100 rtbps ATM networks can achieve about 250,;sec 
latency using conventional network protocols and interfaces. 
Because increasing the speed of a network does not 
necessarily reduce latency. to achieve lower latencies 
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are needed In the oper«in9 system., rhe network 

improvements are neeuc^ 

interface and the network protocol. 

.,, ri«»ficribed in more 

The conventional approach will now be descrab 

w^i.=»r-io«: ro obtaining low 
ae^ail in order to identify the obstacles 

nication conventional conununication systems 
overhead coa.un.catxon. implementations) 
in both distributed systems (e.g.. TCP/IP 

^ / t-Ho Intel Paragon) are 

parallel computing machines (e.g.. the Intel 
and parallel f m such systems, 

oriented towards bulk a_nd stream data. In 

. .ize include a buffer identifier 
messages, often large in size, include 
,,,, and data. A con^ination of interface hardware 
le a^ing system software demultiplexes arriving messages via 
:: :::L I. .nto seguential positions in the identified 

r.^ VMif^er e.q.. 74 in Fig. 1- 
message DUi^-ei:* rs.^'» 

:! ,U cases, .he ne.wc.X interface generates .n 

.„„„P. « the host processor. The operating syste. then 
dUect:, COPUS data Uo. the net.or. .ntet.ace to ■ 
..ssa,e buffer 7, within the operating syste. or sets up 
transfer which accomplishes thi. :.es..9. copy. 

buffet 74 is within the operating 
Because this message buffer 
,„.e. 72 as described earlier, the operating system 7. a. 
: r ceiver . must copy the data fro. the message buffer 7, 
; application memory, such as endpoint 7,. Oneway 
Lilate the copying overhead of such buffering is to 

application memory of endpoint to the message buffer 
Tlternatively, the data could be copied directly to 

network interface, 
annlication memory from the net o 

" .eoardless of the implementation methoo, message 
.e^n/only aemultiple«d to se^ential locations in it 

.„,.e^s in the operating system or in the 
.ne message buf.e s ^^^^^^^^ ^^^^ 

,,,.ication - ^^^^^ ^ .ored within 

interface): the pos.wion ai; . • , » c by 

is determined implicitly. e.g.. by 
£ message buffer is 



n 



pointer. «so. .VPicaUy, eve.y -«.,e 
causes an .n«rr P ^^^^ ^^^^ 
s,s.e. a. the receive. ^^^^^^^ ^^^^^^ 

----- " """Ll anU a..ress a ™ess.,e. or 

ei.ec-. control over either ^^^^^^^^ ^^^^^^^^^ 

. interrupt ° Is appropriately 

„ess.9e. ^"1= °' ' ■ 

""'^"'T: ':r:'::vention. net-or. pr«oco> to 
" i= con^on w.th such on a single 

.P.Uc«ion an. ^^^^^^^^^^ 

,ocessor. — " require, to han«e an 

o..e.-hea., hecause .any ^^^^^^ 

interrupt. First, a trap ,„„rrupt. Finally, an 

„ used to demultiplex the interrupt 

,,,,es are used „„uiplexin, overhead 

,,,.,..,t hand! ^^^^^^^^ „„,„„.ily 

both incurs delay m receiver. The 

'^^^"rrTor::^^^"' — 

unpredictable nature of PP ^ ^^^^^^^ 

asynchronous r^essage arrava ^ ^^^^ .ppiication 

Freguent interrupts also 

,...„r.ance „„,,,„„p„„rs such as the Intel iPSC 

.aaress^n, .n early ^^^^^^ ^^^^^ ^^^^^^ 

,,ed such an approach, t^ ^ ^^^^^^^ ^^^^^^^ 

,,„i.er — oontents to application 
application demultiplexed 

.e.ory. Kernel involvement re^ ns ^^^^^^^^ 
parallel vor.station area, such 

:„ternational Business Machines . ^^^^ ^^^^ „„„entional 

- another fun processor as a 

co^unication protocols ^^^^^ 

„^u.ic.tion .„iv.ls. such 

Which is used to han<5ie y 
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iiepful for handling 
a con«r.unic.tion co-processor is also useful 

con^Ucate. ,«her-sc.«er operations, vMc^ -rise v.en ..r,e 

.re use.. Since ..U co-processor aupUc.es 

hardware, it is expensive- 

.„s.eaa Of devotin, si,nific.nt resources co.ple...y 
receiver, UKe r.e co-processcr appro.cn. to determine 

„ .eposU .n. .0. ro ..n..e -"'^^ ' -^^ ^ 
-t.is ■ aerer„in..ion. Such . Xnown, though less 

■ , ..„ro.ch for . cow.unic.tion systen, is shovm >n 
conventional approacn tot • <- . ' ,. 

.ne bloc. .i.,ra. in ri„ 3- ^ this system, a sender 87 ana 
receiver es are connected by net»orK .2 via networK 
interfaces e, and 90, The respective operating system S. 
end 02 need not have .essage buffers. In this syste., the 
sender specifies where a „ess.9e^in_^e_ deposited in the 
receiver. As"indict"«d by the example co™.and S5 in ng. 3, 
... -send ' co»«an6 now also indicates where the message is to 
deposited in the receiver. i.e.T by including the "receive 

to the receiver 88 includes not only header and data 
.creation but also the address at the receiver 83 into 
„„;ch the message will be placed. The sender may also 
.irectly specify whether an interrupt is to be generated upon 
mess.oe arrival at the receiver. 

Hg . is a flow Chart describing the general operation 

. „f Flo 3 First, the sender determines an 
of the system of Fig. 3. 

— ro be deposited at the 

address where the message is to be P 

receiver. The sender invokes the send command in step 

,nis address. The message is then injected into the 
network in step IW directly from the application memory^ in 
contrast to the conventional system shown in .ig- ^ which 
a message from an operating system level message 



m 
wh 
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' ' . • ..rface the message is den^ultiplexed 

from the networK interface, ..ssage (step 114). 

receive address in the message is y 

directly into the receive -^^..^nal network system 

• this is in contrast to the conventional 

Acsm. this as m messaqe buffer. 

.... «-.ac. n » an ope.Un, ; 

„e.o. « an. ero» o.e»Uon " - 

. . Pia 2) is omitted. »>tnet. 

-a directly from .pplicetion memory in 
„p,ed duect Interrupts to ^the 

"TmJ not e .„dic«ed .y t.e mess.e, ..e 
accessor m.y or m. ^^^^^^^^^^^^ 

,„errupt state >s ^^^ J . 

j__ Various reaiii"*--^"-'"^ 

the sender, varaou indicated 

^ rforails of protection, as m^-^ 
the presence, amount, and details o p 

r Ii:. protocol inter.« n. .een used 

V such as the Teta 3D 

. 'c parallel machines, such 

pr.mar.ly -o. P Minnesota, and the 

3.aia.le from Cray ^ ^,„,,,, „.,.i„e 

university .^H machine. Such a 
. global address space uses a message. 

V^uH <i vj-L'-'w . . /carries bOth 

,e ■ word or cache block), which carries 
size {e.g.. ^otu directly in the 

data The data is stored directly 
address and data. method, 
,e„iver address contained in the 
demcltiplexin, at the receiver is triv al, 

,11 the information. Consequently, this 
specifies all the in .^dressing. The 

„ 1. called sender-based addressing 
message handling is c.Ue a.scribe 
enalogous term "sender-based interrupts 

.ne interrupt ...^ed to have 

^:r:se o mliple users interacting benevolently 
a single user, » ^hoi' internal 

issues are often not addressed in .h-i. 
Protection issues exclusively tor 

network systems protection 91, 

.^=.nel urocessors. Tna^ is. r 
interconnecting parallel pi 

• -^-3 are often omitted. 
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This invention overcomes problems in tne prior art by 
directly depositing messages where they ere required. Fig. 5 
is a block diagram of a con«r.unication system in accordance 
with the invention. It includes a sender cor.puter 270 and 
receiver cor.puter 272 interconnected by a network 82 via 

network interfaces 274 - and -276 The sender and receiver 

computers 270. 272 each have respective processors 54 and 70 
programmed according to a desired Operating system 278 and 
280. For the purposes of illustration, the terms endpoint 
and connection will be used. These terms should not be 
construed to limit the invention, as the invention is 
applicable to many types of computer networks. As used 
herein, an endpoint signifies a contiguous region of virtual 
-r.er.ozy. A connection is a virtual channel authoi'izing 
cor.T.unicction between a pair of endpoints. A connection may' 
also be ir.ulti-cast. i.e.. not restricted to pairs of 
er.cpoints. AppHcctions 58 and 60 being executed on the 
sender 270 and applicatior.s 7B and 80 being executed on 
receiver 272 each have one or more endpoints assigned to 
ihem. e.g. respectively 66. 68. 79 and 81. Each of the 
operating systems 278 and 280 store connection state and 
r.a??ing information 282 and 284 respectively, which are 
i.ndicative of the states of the sender 270 and receiver 272. 
This information preferably is cached in the network 
interfaces 274 and 276 as indicated at 2B6 and 288. 

A message 290 sent from the sender 270 to the receiver 
272 includes both control information and data. The control 
information may include an indication of an action to be 
performed, one or more operands indicative of a state of the 
sender ar.d one or more references to information stored at 
the rece:ver This information stored at the receiver may 
also be called state maintained by the receiver or receiver 



state . 

2.1 



ri, e is . flow chart describing generally the operation 
sys.e. shovn in ri,. 5. The sender ^irst generates 
separate control inforoation and data and constructs a 
^e'ssage in step 115. As indicated by the example con^and 101 

.is. 5. the -send" co»».nd_ ilicludes an identifier o£ a 
correction, an address £ro» which' data is to be ta.en, 
control information and an indication of the amount of data 
to be sent. 

This message is injected directly fro. application memory 
,„to the networ. in step 11.. Hext, the message is extracted 
from the network into the network interface in step 113- m 
s-ep 116. the network interface demultiplexes the message, 
aepositing data directly into memory and/or conditionally 
delivering interrupts to the processor. 

. message originating at a source endpoint bypasses the 
operating system and host processor at the sender. At the 
-eceiver. the operand may be compared to connection state 
.-,or..,tion to determine whether an interrupt should 
conditionally be delivered to the processor « and the 
receiver 272. Also, these operands may be combined wath 
connection, state and mapping information to determine the 
.ce^ess in memory in 79. for example, to deposit the message 



data 



Thus, in this invention, copying of message data from 
operating system message buffers to application memory is 
omitted. Furthermore, interrupts are conditionally generated 
accordino to both sender state and receiver state. Also, an 
address into which message data is deposited is determined in 
part by sender state and in part by receiver state. Thus, 
sender may be isolated from too much knowledge of the 



receiver. 



>2. 
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This system is generally based on the observation that 
message sending is simple, whereas message reception is 
complex because of the asynchronous nature of message 
receiving. Message handling at a receiver should therefore 
be separated into- message delivery and message action. This 
separation allows control of asynchronous events, wherein 
events are distinguished, by their need for the host processor 
at the receiver. Events.- such-as message delivery, which do 
not require the host processor are handled directly. Other 
events, chiefly synchronization, that require interaction 
with the host processor are further divided into actions that 
require immediate service and actions that can be delayed and 
accumulated and processed at some time convenient for the 
host processor, thereby turning them into synchronous 
events. with this separation of data"~'and control events, 
resources sufficient to bypass the non-host processor events, 
rather than all events, is all that is required. 

Message delivery simply involves depositing a message in 
a desired location in the memory of the receiver, e.g., by a 
remote write or a direct memory access (DMA). Message action 
is taking some action in response to reception of a .message, 
such as returning a value, performing a read operation, 
notifying a task that the data has arrived, enabling a task 
on the scheduler queue, which is a data structure indicating 
tasks eligible to be executed by the processor, or invoking 
an arbitrary interrupt handler, e.g.. a remote procedure call 
(RFC) . 

Whether an action is immediate or delayed depends on 1) 
when a remote process awaiting the result of the action (if 
any), hereafter called the waiting task (not always the 
sender), needs the response, and 2) the priority of the 
action relative to the priority of other activities at the 
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receiver. Some examples of immediate actions are a reaa or 
synchronization operation where the waiting task needs the 
result to proceed, and a high priority control operation, 
such as some operating system action. Some examples of 
common delayed, actions are notifying a task that data has 
arrived and enabling a task on a. scheduler queue. The 
related message may also be referred to as not requiring 

ifOTediate action. 

In many cases, a system may be structured so that an 
immediate response is not necessary. For example, a remote 
node may execute another task while it awaits a response from 
a message action. Of course, if a message is destined for a 
waiting task which is not currently active anyway, e.g.. 
notifying an inactive process that data has arrived, any 
action in response to that message m_ay be delayed... When an 
action may be delayed, the message may be queued for 
processing at a later, more convenient time for the 
receiver. Thus, a message for which an action may be delayed 
becomes a synchronous event. Conveniently, queuing a message 
is merely message sending and a queue pointer update. 

TO implement this kind of protocol, herein called direct 
deposit messaging, a message contains the connection, which 
is an identifier which implicitly identifies the receiver 
endpoint, control information indicating both an action to be 
performed and one or more operands, and data. Each operand 
„ay be an address to be used by the receiver, and/or may be a 
parameter used to determine whether the specified action must 
be performed immediately or may be delayed, and/or may name 
some receiver state. Addresses are encoded as an offset from 
the base of the endpoint. The offset is essentially a 
network logical address that insulates the sender and 
-eceiver from the addressing details, e.g. address space 
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size, virtual to physical mappings, and page size, at the 
other. This separation promotes modularity and accommodates 
node heterogenity. Furthermore, an offset typically does not 
need the full dynamic range of a virtual or physical address 
"^d'thus can bFencoded in fewer bits within a message. 

A set of primitive actions-, representing common 
-6pefations that may be implemented simply without host 
"processor "or ~^ratingr'systeiH""inte.rv-ention. is provided. 
More complex actions are left to the host processor. The 
primitive actions described herein are simple data trans'fer, 
i.e., read and write to endpoint locations and conditional 
interrupts to the. host processor for delayable or immediate 
actions. 

The simplest operations are pure sender-based direct read 
and write data transfers. For a direct write, the sender 
specifies the source data by its offset from the base of the 
source endpoint and the receiver location by the offset from 
the base of the receiver endpoint. Messages contain the 
receiver offset and the data. For reads, the source sends a 
message with a direct write request to the receiver, along 
with the offset in the receiver and the deposit offset (for 
the reply) in the source, and an indication of a reply 
connection if the connection is not duplex. 

To enable actions which are a function of both sender and 
receiver state, the receiver end of each connection maintains 
some state, i.e., stores some information which message 
operands may name and so obtain receiver addresses. To 
simplify matters, this state is contained in specially 
addressable locations which herein are called "address 
registers". This state could also be held in general memory 
locations. Thus, message actions are a function of an 
operation specified by the sender, operands representing 
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acr the contents of the 
,er.a« state, and receiver state, such as the 

and'^ess registers. 

aoo-e&=» ^ r^rimitive operations 

The fonoving is an example set of pr.matx 

using sender and receiver state. 



^AA^t>ss G pneration: 

1 Direct addressing: effaddr = operar^d 
V indirect addressing: effaddr = <addreg,> 

,r.n- effaddr = «addregi> + operand) 
3. Indexed addressing, eftaoot i 



p^rr-j^rer opprations: 

1. addreg^ * operand 

2. addregi - unary-op <addregi> 

3 addreg._;^<addreg.> binary-op <addregj> _ 

Where i and j are not necessarily different 

1. if «addreSi> compare-op operand) 
then generate interrupt at end 

2. i£ (<addregi> compare-op <aadreg^>) 
then oenerate interrupt at end 

Where i and 3 are not necessarily diUerent 

some .or. Of address generation unit caicuUtes an elective 

.adress <e»addr, at v,hich to read or .r.te da ta. <X 

aenotes the contents or .emory location X. "Operand .ay « 

or other operand in . .essa.e. The message operatic 

, , „ad or -rite occurs to memory, the 
controls vhether read ^^^^^^^^^^ ^^^^ 

m^mitives selected, and their 

L occur at any time .ut the interrupt preferably occu. 
the end of the compound operation. 



These primitive operations allow a rich set of powerful 
and flexible compound operations. For example, an indirect 
write with postincrement can be " synthesized with an 
indirection followed by a register operation: 

I 

<addreg^> * KSG 

addreg^ <addreg.> + operand 

The last step may also be:"addi-eg7^<addreg7>-Kaddregj > 

Done on a per-cell basis, this compound operation is 
equivalent to DMA with stride equal to the increment value. 
However, note that varying operand or <addreg.> yields 

variable strides. 

AS another example, priority gueueing and interrupts can 

be syritheriWd— as="f6i-l-ows-: 

" <acdreg^ > WiSG 
addregp - <addregp> + oddreg^) 
•if (operand greater than <addreg.>) generate 

interrupt at end 
adcreg^ - <addregi> bitwise-or operand 

where "operand" indicates the priority of the message, 
addreg points to the end of the queue to which a message 
of this priority should be added, addreg3 contains the size 
of MSG. and addreg- holds the priority level at the 
receiver where p. i and s are different from each other. The 
nessage specifies the operand and register indices p. i. and 

Co.T.ole:< compound operations liKe this priority queueing 
„ay reqaire multiple compound operations. For example, two 
compound operation messages would be required in this case .f 
the receiver executes one register operation per message. 

2.1 



As e final example, sonecimes it may be convenient to 
append rr-essaces to one of several different queues without 
genereting interrupts and maintain a bit vector of non-errpty 
<jueues. This mechanism can be implemented in generally the 
same way as priority-based interrupts, but with the most 
significant bit of addre^et to 'block interrupts. This 
implementation assumes both unsigned comparison and fewer 
queues than bits in addreg^. This' second assumption could 
be relaxed by using multiple address registers. Messages may 
also be appended to a. queue within a specialized endpoint 
within the operating system. enabling delayed actions 
involving the operating system. 

As other examples. various atomic operations such 
fetch^.-id-incren."^, read-nodify-write. and compare-and-svs? 
can be implemented by devoting one or more of the address 
registers for the target location and using the register 
cpereticns for incrementing and comparing. Barrier 
sjT.chrcr.ization can also be implemented this way. When a 
process reaches a barrier point, it toggles a bit in a 
specified address register of all processes in the "barrier 
set" a-d then waits for a conditional interrupt when all the 
tits ere set (or cleared). 

This protocol provides at least the following benefits. 
First, the corijination of sender state, receiver state, and 
operations on the two is very powerful. The full superset of 
capabilities of both conventional, receiver-based addressing, 
and sender-based addressing is possible. The mix of sender 
and receiver information can be varied on a per message basis 
tc accoT.c.od8te different requirements on whar the sender 
kno-s. or alternatively, different requirements on the 
isolation of knowledge between sender and receiver 
Indirect ion prov 



ides protection by isolating the sender fron 
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too much knowledge of the receiver because the actual storage 
address is partially a function of an address the sender may 
nor know. Similarly, the interrupt status is partially a 
function of receiver information contributed (via messages) 
by processes that the sender may not know exist. As will be 
described below, a mechanism may also be provided for 
register protection that prevents a. sender from accessing by 
read or write or otherwise determining or modifying the 
contents of specified receiver state, such as address 
registers . 

Second, predictability of computation, which is important 
in real-time systems, is increased by restricting control 
flow interrupts to well-defined points. In effect, the 
asynchronism and nondeterminism is eliminated from 
asynchronous network events. 

Third, action handling is more efficient and results in 
less overhead because interrupt overhead is amortized over 
multiple actions. Polling can be used to synthesize hybrid 
interrupt-polling action methods. 

Finally, more complex operations can be formed by 
combining the result of multiple compound message operations. 

This protocol is preferably endpoint and connection 
based. Endpoints and connections are allocated and 
deallocated with kernel calls. Preferably, endpoints are 
page-aligned. Thus, host virtual memory page protection is 
also used within endpoints. A connection may be established 
between any pair of endpoints. including endpoints on the 
same node. The connection establishment protocol is much 
like session establishment in Berkeley UNIX sockets. Some 
out of band mechanism, such as a boot-time agreed upon kernel 
endpoint and connection, is used to arrange allocation of the 
encooint and connection in the receiver. 
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An endpoint may have multiple originating connecnonS 
anc/or multiple terminating connections. Connections can be 
simplex, duplex, or multicast. Connections originating from 
or terminating on an endpoint all .share the same mapping 
information. However endpoints can be overlapped or nested 
to form more complicated protection ' patterns . For example, 
connection A could create an endpoint with virtual address 
bounds <v^, v^). Then connection B could create a 
second endpoint that is a proper subset of this range to 
allow connection A access to all of B. but B to only access a 
portion of A. Or. connection B could create an endpoint that 
partially overlaps with A's (v^. v^) range to allow 
connection A and B a limited range .in which to share without 
exposing their entire respective endpoints to the other. 
Different protection--schemes can also be-r-ealized by mapping 
the physical pages of an endpoint to virtual address ranges 
with different page protection. 

Network protection is provided as follows. Access to the 
network via out-going connections is controlled by 
per-ccnnection state maintained by the kernel. Messages 
arriving froni the network check for authorization with the 
receiver connection state maintained by the receiver kernel. 
Authorization to receive a message from an incoming 
connection implicitly authorizes the message to write in the 
associated endpoint. However, the receiver address must 
still map to a legitimate endpoint address and the operation 
must be permitted by per-page access rights. 

Protection can also be provided in such a system by 
having the receiver verify that the operation requested and 
the address used at the receiver are legitimate. For 
exa.T.?le. a memory region can be specified for each 
application. If the address to which data is written is not 
in the specified region, access is denied. 

30 



An implementation of this system, specialized to Atk 
networks, will now be described. It should be understood 
that the following is just an example, and that the invention 
may be implemented for other networks and in different ways. 

Fig. 7 shews one possible format of a 53 byte ATM cell 
120 for this implementation. In the simplest format, a cell 
includes data and control information along with the standard 
network header and other information, The sizes of the data 
fields in this format are merely exemplary and are not 
intended to be limiting. 

In Fig. 7, an ATM header field 132 contains link routing, 
which indirectly identifies the receiver, and traffic control 
information. This field is five bytes and is in a format 
suitable for processing by a standard ATM switch. The 
connection number - is— encoded in a virtual channel/virtual 
path identifier (VCI/VPI) field (not shown) in this header. 

There are also a field of 32 bytes of dat-a- 122^- and 16 
bytes of control data (discussed below) per cell 120. The 
data field size matches memory and cache block sizes of 32 
bytes and thus enables fast, efficient hardware 
implementations. Data masks can be used to eliminate 
unwanted message data at the receiver, as explained shortly. 

A cyclic redundancy check (CRC) field 124 of two bytes 
may also be provided, to correspond to the data field 122, at 
the end of the cell 120, to help prevent an errant message 
from being interpreted as a valid message. Two other bytes 
are unused. 

The control information includes a four-byte operation 
field 130 which specifies the type of operation to be 
performed at the receiver. This operation field of 130 may 
include a mask field 131 and opcode field 129. The opcode 
field soecifies the operation, whereas the mask field can be 

Si 
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used to deselect the reading or writing of four byte words 
within a block of the data field 122. That is, bit i in the 
mask controls whether data word i is read or written. This 
feature is useful to update a location without changing the 
values (e.g., variables) in neighboring locations in a 
block. A four-byte operand field 126 is also provided. The 
operand is a 32 bit immediate soilrce operand (offset or 
data). Destination operands are specified via three separate 
register indices encoded in a four-byte index field 128. 
These separate index fields are shown at 121, 123 and 125. 
The check field 127 contains a simple check sum over the 
prior control fields so that decoding of the control can 
begin without waiting for the entire cell to arrive. 

For a read request to a remote node, the data field 122 
contains the control field for the reply write message. 
There is also a multiple cell message format for block 
transfer. In this format, the first cell is a "control" cell 
in the write format shown in Fig. 7, and the following cells 
are standard AAL5 (ATM Adaption Layer 5, an ATM signaling 
standard) cells. To avoid complexity with cell boundaries 
and length and CRC in the last AAL5 cell, all block transfers 
are multiples of 16 bytes. 

As mentioned above, the direct deposit model of 
communication herein described is endpoint and 
connection-based: an application allocates endpoints, sets up 
connections between the endpoints, and then sends messages 
over these connections. To support such operation with the 
direct deposit model of communication herein described, the 
operating system at each node maintains the following data 
structures. However, a hardware implementation may cache 
some cr all of these data structures to support high speed 
oDeretion, as will be described in more detail below in 
connection with Fig. 9. 
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A„ e^dpoint table includes .n entry tor e.=h endpoint .t ; 
,,,, „„,e Ce.,.. sender SO,, indexed by .n endpoint n>™.er. . 

entry in this table contains indications o£ a base 
lory address for the endpoint. e.g. in .e.ory " or ei at 
..e receiver endpoint si.e, virtual to physical n.pp.n, 

- - - , . read only or 

information, access information, e.,.. private, 

■ , ■ fho endDoint . and any 
..siared, all open connections to the endp 

processes attached to the endpoint. .-. 

. connection table includes an entry for each connection 
orislnatln, or terminating at that node. Indexed by the 
connection nu^er. Bach entry In this table contains an 
indication of an endpoint number, address register base and 
.ounds. connection state information, and reply connection 

information. _ _ 

. node address table is also used. Each entry for . node 
..crudes the name of a remote node and the connection nun^er 
;or a connection .direct or indirect, to the operating system 
Of the remote node. The Index to the table Is a unique 
,,obal Identifier for each node. This table is used to 
contact remote nodes for connection set up. This table may 
,>so contain naming Information via some alternative 

K»r,^en, such as Internct-prot ocol (IP) 
signal ling mechanism, sucn 

addresses . 

.ata are delivered to endpolnts and interrupts are 
..livered to the operating system. specifically, data are 
aellvered to the endpoint at the receiver of a specif, 
connection and not to processes which may be attached to 

omts interrupts are delivered to the operating system 
endpoints. inte^i.uh' 

.he receiver and not to specified processes. The 
running on the receiver soecified 

„,.v then deliver interrupts to specitiea 
operating system may then 

processes . 
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for such a system will be described in 
operation setup for sucn ^ x . so 

- -i- 12 The sender 50 
^irh the flowchart of Fig. iz- 
romection watn tne 

.uoc.es 3„ enapci« in visual .a.ress space n 

virtual ro physical -apping information. 

• .„ alternate connection, perhaps a ded.catea 
Then, via an aitec«o«-« 

an alternate network like a 
operating systen, connection or an alterna 

nrrnl pr otocol/internet protocol (TCP/IP 
t-ansport control protuv, ^ ' 

• in step 201 the sender contacts the intended 
connection, m step 4^^ 

^ rpmiests a connection be setup with an 
receiver and requests a 

V **or. size The receiver then 
appropriate endpoint buffer size. 

-^^ ir, Ti-s virtual address 
3.,ocates that si« buffer regaon >n its 

s; ce, finas or .aKes a free slot in the enapoint table ana 
Als-it «xh-the- buffer base.__aaar.ess,_. and virtual to 

. - - --hi receiver then acknouleages 

physical mapping information. The 

-:.e connection Tn-step .0=. In .-multicast connection. 
,.,.ea.re is repeated for each sendecreceiver pair in the 

• nffset from the base or 

..uiticast. Messages containing an offset 

rh<>n be sent over the connection in step 204 
the endpoint nay then be sent 

„i can be implemented without special-purpose 
This protocol can ce imp 

r«mouter progran, on a commercially available 
hardware using a computer ptoy 

„ interface. Such an implementation has 
computer and network interface. 

. using two DECStation 5000/240 workstations, 
been maoe using two 

.vailable from digital Eguipment Corporation. Each 

■ haa a rore Systems TCA-100 ATM netvor. interface 
vorKstatlon haa a y 

• -^i-c; "TURBOchannel ^/^ ^^=* 

piugycw . without an 

■•"-r — r ~ '■ - - 

ATM switch. ^ww^ ard 

^^.^ nff-chip instruction ana 
. 64Kbvte direct mapped oft cnip 
processor. 64Kbyte ^^^^ 

^ ->-5MH^7res of main memory. Ant? 

... caches^ "'^J^tc,^,,^, 3.-hit -iae TO^BOchannel . 

and I/O subsystems, mciu 



o,e.ated .t 25KHz. The Fore TCA-lOO v.s • very sample 
i«erface containing two nFO qu.ues, one for transmit and 
„„e for receive, and .om. control and .t.tus registers. An 
cell was transmitted by the processor writing fourteen 
3j-,it words representing the 5 bytes of ATM header. « bytes 
payload, and 3 bytes of padding over the TURBOchannel 

r - It, ATM cell was received by the 
to the TCA-100 interface. An ATM ceii 

•JO Knf" words Although the 
processor reading fourteen 32-bit . . worfls . 

TUR30channel supported DMA. the TCA.lOO did not use it. ^ The 
TCA-100 either generated" a TURBOchannel interrupt when a cell 
arrived or a receive cell counter on the TCA-100 was polled 
to determine if any cells have arrived. The data rate over 
the fiber connecting the TCA-lOOs was UOMbps. The 
DECStations ran the Carnegie-Mellon University (aUJ) 
..croKernel_-based operating syste. M.ch 3 . 0-(.MK.a3 . UX<1), for 
Which full source code is readily available, -including full 
source code, fron C^W of Pittsburgh. Pennsylvania. 

A sir-.ole remote write function was implemented on this 
syste. for experimentation purposes. In this implementation, 
a 32 byte block of data at a given offset from a endpoint in 
^he se.nder was delivered to a sender supplied offset in a 
sender named endpoint in the receiver. The data, offset, and 
buffer information were packed into a single cell. The data 

« •t-> bvtes for agreement with cache block 
block size was 32 i>ytes. i"*^ »^ 

rn be 32 bvte block aligned 
sizes, restricting the offsets to be 3Z oyT. 

fo" implementation ease. 

' The low level «.ch kernel exception handling code was 
,lso modified to send a cell via an illegal instruction 
-a. on the receiving side, the Mach microkernel was 
modified to partly optimize the interrupt path. In 
particular, the TCA-ICO handler was called directly from the 
kernel interrupt trap handler. 
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Using the inplefnencation descrioec above, the followinc 
excerisient «as performed. From user level a previously 
stored block of data was sent from the source workstation to 
the receiver. A user level process at the receiver was 
running, testing the endpoint area for the arrival of the 
data. once- the data arrived, the 'receiver process sent it 
back to the source workstation where another usee level 
process was running, testing for the arrival of the data. 
The total round trip time was then measured. This includes 
two one way remote writes, loop overhead, since the round 
trip was repeated twenty times, and measurement overhead. 
The results are listed in Table 1. below. 

First^ iteration Average best 

;ivercqe send overhead 25 psec 3.4 ysec 

Average send to receive > imsec . . - 49 1 wsec 

Table 1: Remote write latency 

The first row is the average time to send a cell via the 
illegal instruction trap send. The second row is the round 
tr:? send-to-receive time corrected for loop and measurerient 
overheads. The first iteration column lists the time on the 
first iteration, and the average best column lists the 
average of the times on 19 following iterations. After the 
first two iterations typically there was very little 
variation in the times. 

The first iteration incurs cache and 

translation-lookaside-buffer (TLB) misses which causes the 
send-to-receive time of this iteration to be much greater 
than that of subsequent iterations. (However, it is not 
clear why the send to receive time is so large on the first 
Iteration. The first iteration send overhead is much more 
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reasonable). After all cache and TLB misses and other 
transients have dissipated, the round trip time was 49.1 
psec, which means that the best case one-^way 
send-to-receive tir? for a remote*- write was about 24.5 
\isec, which is about 80 times faster than a similar test 
using Fore System's AAL3/4 implementation on the same 
hardware running Ultrix 4.3. 

The bandwidth was measured by -sending a sequence of 
consecutive remote writes as found below in Table 2. 

Block size Bandwidth 

10 19 Mbps 

2 0 22 Mbps 

4 0 24 Mbps 

Table 2: Bandwidth achieved with remote write blocks 



The block size is the number of consecutive remote writes. 

The bandwidth 'increases' as the blo'ck size increases since 
the interrupt overhead is amortized over more data. The 
TCA-100 interrupt handler reads cells until the receive FIFO 
er.oties. Thus if cells are sent sufficiently close together, 
only one interrupt, and hence one path through the interrupt 
trap handler, may be required to read all the cells. Since 
sending a cell was fast with the illegal instruction trap 
method, this amortization effect is easily obtained, even 
when consecutive blocks are sent using an ordinary "for" 
loop. The asymptotic bandwidth achieved is thus a measure of 
the per cell overhead in processing a cell from the receive 
FIFO. 

The 24nbps bandwidth is based on 32 data bytes per cell. 
Using the same hardware, others have reported a peak 
bc-^-dwidth of about 48Kbps using the 44 data bytes per cell 
^^13/4 format. This cell rate translates into a peak 
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bandwidth of about 35Mbps using 32 data bytes per cell. Thus 
the partly optimized single cell remote write implementation 
described above is obtaining about 68% of the practical peak 
bandwidth that can be obtained with 32 data bytes per cell. 
Even so. the 2.«bps bandwidth is significantly better than 
the 14nbps peak bandwidth measured for Fore System's AAL3A 
implementation with the same hardware" running Ultrix 4.3. 

The 24psec latency reported "above breaks into the 
components shown in Table 3. 



TUR30channel time 7 psec 

ATM cell time 3 psec 

CPU and memory time 14 V'sec 

Table 3: Latency breakdown 



Since the TCA-100 ATM interface does not have DMA, all 
accesses to the interface to send and receive cells use 
p-ogr^mmed 1/0. Programmed 1/0 is slow on the omBOchannel, 
resulting .n nearly one third of the latency. The ATM cell 
time refers to the time for 53 bytes (one cell) to completely 
collect at the receiver at the 140Mbps data transfer rate 
used by the fiber connection. The remainder of the 24 .sec 
(14 psec) is. of course, the CPU and memory time at the 
sender and receiver, and is 68% of the total latency, 
consequently, optimizing the interrupt handler code by 20% 
viU only reduce the total latency by about three psec 
Thus, although there may be room for optimization in the 
current implementation, only diminishing returns would be 
Obtained. Thus, it is probably fair to conclude that 
approximately 20 .sec is the minimum latency for remote 
write with this common commercially-available hardware. 
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in view of these experimental results, special-purpose 

hardware should be added for protection and for direct 

depositing of data in menory to reduce the load on the 

processor, and to reduce memory and I/O bus time. This 

hardware support should reduce the end to end latency to 

close to single cell time: 2.7psec.at a data rate of 155 

„,PS. and .6Bpsec at a data rate of 622 Mbps . The reduced 

load on the processor could also be' important for a server, 

such as file server, in a distributed system that has a heavy 

■r^.r^nn load Also. variation in latency is 'also 
communication ioao. aas.u. 

reduced and even the worse case latency can be good. 

such an implementation in hardware of an architecture of 
a network interface for supporting communication in 
accordance with this invention will now be described. 
Because a node typically may act .both as a receiver and 
sender, a network interface for a node should handle both 
sets of functions. The functionality of the receiver 52 will 
now be described and includes address mapping and protection, 
address registers, control for data and address paths, and 
flow control. The sender will be described below. 

The implementation has two parts. The first part is a 
front end architecture for address mapping. address 
registers, and control mechanism functionality. This part 
has multiple possible embodiments, each having generally the 
same functionality. A general block diagram is provided in 
Fig. 8. of which multiple embodiments are described below. 
The second part, the back end which connects the network 
interface to the host memory, also has multiple possible 
e^odime.nts. Three of these are also described below: a 
direct connection to the main memory, a traditional I/O bus 
connect, or a direct connection to the secondary cache. 
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The front end .rchitec.ur. wiU now be described in 
connection with Fi,. 8. The front end o£ the netuor. 

n oin Th^ front end connects 
interface is shown generally at 210. The tron 

- The connection of 214 

to the host memory 212 vaa a bus 214. The 

4-v*'o KArk ©nd which will be 
to host memory 212 is called the bacK ena 

discussed in more detail belov. Trie front end 210. on the 
receive side, includes a receive buffer memory with flow 
control 2U. This receive buffer is preferably a first-in. 
first-out (FIFO) memory element. A header splitting and 
cheCin, unit 218 processes incoming cells and demultiplexes 
the information to VCl/VPI mapping unit 220. control decoder 
„2 ar.d data splitting and checking unit 224. The data 
splitting and checking unit passes blocks of data to block 
transfer unit 226, which can transfer data to the host memory 
across the back end "214. The VCITVPI mapping unit 220 
eete=r.ines a connection number and applies it to operation 
logic 230. The operation logic will be described in more 
detail below in connection with rig. 9. 

" control decoder 222 decodes the control portion of the 
incoming cell and determines an index and operand which are 
provided to operation logic 230. The index and operand are 
Ihe sa.,e values indicated in the ATM cell of Fig. 7. The 
control decoder 222 also outputs the opcode found in the ATM 
„n to a receive controller 226 which provides control 
in.ormation to the operation logic 230. The operation logic 
outocts, an address, state information, a condition code, a 
,epiy connection nu^er and interrupts. The address and 
interrupts go to the back end 214 whereas a reply connection 
-.^er goes to the VCI/VPI mapping unit 232 on the send s.de 

H ,10 Also on the send side of the front end 

the front end 210. aiso 

,[o is send controller 234 which receives information fror. 
the .-.est processor over bus 214 to provide control 
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010 and a connection 
information to the operatzon logic 230 

- « t,«nr The send controller 

runner to the VCI/VPl mapping unit 232. 

A .oni^rers 236 frotn which opcode, operand 
also includes send registers ^jo 

■ ji^A 9 rnntrol encoder 238 
and index information are provided to a contro 

i^^r^'t>^c^ rontrol portion of the 
which forms this information into the contro 

„H cell describei .bov. in connection with rig. ■> ■ 

^ hlooc transfex unit .».0 .on the' sena side processes data 

■rrom the host- .emory into a data buker..... A cell forming 
. takes the connection information, control 
information, and data and forms a cell -ith appropriate 
header information which is then applied to a flow control 
ana outpot buffer units 2«. The output buffer is preferably 
. riFO memory element. The message data may alternatively. 

-originate from data registers contained within the send 
register 236. 

;,side from the receive and send controllers 22, and 23. 
.„a the operation logic 230. the remainf«, functional bloc.s 
... standard for network interfaces or are relatively simple 
<n function. Thus, detailed description of these is 
omitted. The operational logic 230 and receive and send 
controllers 228 and 23,, with a hunger of embodiments, will 

now be described. 

The operation logic 230 of Fig. e will now be described 
i„ more detail in connection with Fig. For ease of 

a.^«„ Tatches and control signals 
description and illustration, latches 

between elements in the figure have been omitted. The 
Signals on the left-hand side of Pig- S come from the data 
.ields in the AT. cell as described above in connection w.th 
:i.. per example, a connection number (conn«) is derived 

from the VCI/VPI in the ATM cell header. ' 

, 230 includes a connection table 1*0 

The operation logic 230 inciuae 

. 11 or part of the connection table, described 
which caches all or part 
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previously. Which is stored by the operating syste-n. 
Accordingly, for e.ch connection, this connection table no 
contains an entry »2. which includes an endpoint nu^er x«. 
address register information, inclyding a base »8 and a 
rounds limit and connection ^tate Information 150 «id 

The use of these fields "iH be 

reply connection 151. ine 

described in more'detail below. 

An endpoint table 160 is^aTso provided which caches all 
o. part of the endpoint table, described previously, which is 
naintained by the operating system. Accordingly, for each 
endpoint, this endpoint table 160 contains an entry 162, 
indexed by endpoint number, which includes an indication of 
the base 16« of the endpoint. its bounds 166 and address 
capping information 168. The address mapping information 
refers to a page map structure for that endpoint which is 
stored in host memory. This endpoint information is in a 
separate table so that multiple connections to the same 
endpoint can share the same information. 

Address registers 170 are also provided. There are many 
options for implementing the address registers. Mdress 
-egisters are preferably private to each connection to limit 
-oblems with managing a possibly limited resource between 
competing activities. Any address within an endpoint could 
serve as a register for address indirection. Unfortunately, 
such generality causes access speed and protection problems. 
The speed problem is that a message with indirect addressing 
requires a first host memory access in the critical path to 
determine the location to store a message and a second host 
memorv access with post-increment mod. to store the updated 
,nde.". The protection problem is that the indirection 
location can contain any address. The protection problem can 
,e solved by translating this address, but then two address 
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•rod in the critical path - one to 
translations are required m 

, ^he indirection location and another 
translate the address of the moire 

„an.ate ... ...... in - in.U.cUo„ Xoc.»on r.. 

Lse reasons .n.i.ecUon U P«.«.My r.«r.=«- to 
X o. ae.ic«ea .a.a»« ».U.e. pe. connexion .cc«e. 
numoer ui. introduces the 

This arrangement introuuuc 
ir interface memory. ^^^^ 
l^aness . _..sepa„« n..e sp.ce. .u. evc.s .os. .e.o. 

ccesses - .he. criUcU p... '^cr i"^'— ' . 

onnecuon n.s . n>:^e. co«i,ucus .oc.Ucns to *cr. 

-w^nao... ro. convenience .na «e.iMa«y. .-ch 

oLc.on U enowea .0 a^a^UeU. eUoc.e ..e ..naow s.e 
„ connecUon se. up tine. T.e .ase .na tounas f.eMs 
„a »S in the connection table entry U2 point to 
^,i„nin, ana ena o. this vinaow .espectively. This sc.e»e 
:no„s the ovetuppin, ana nestin, o. .e,istet -.naov. tc 
e^.ect aifferent sharin, ana protection. 

one prohle. -Ith usin, .aaress roisters -.s that 

,.„we. ..y -ish to restrict the access a senaer has to 

„rta.n aaeress re,.sters, ror example, the rece.ver n,.,ht 

„ „ the senaer to have the airect ab.Uty to 
not "ish to allow the senoei v , 

. . ^..ue For these purposes, the 
increti,ent a pointer to a queue. 

• ISO Incluae protection bits H4. which 
address registers also incjuae f 

J. have to the data in 

indicate what type of access a sender may have 

Three types of access are protection 
rhpce registers. Tnree tyt' 

o aea .or: read, write, ana inauect. Indirect acoes 
ows a senaer to use this indorsation in an opera ion 

. «naer to aetern,ine the actual value, 
without allowing the senaer to oet 

Z e.a.p.e, an inairect aaaress operation with postincre.e t 
a the register ana increment it by a specif iea amount, 
can read the registe ^irectlv read 

J r.r.r have permission to directly 
even if the sender does not have p 

..i-e the register. An exception occurs if the 
or wrice tne ^ reaister access 

„.cifies an operation involving an aaaress reg.st 
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in a way not permitted by the access protection. The hosr 
processor can thereupon choose to access the register 
directly. 

Although address register protection allows a receiver to 
restrict sender access. the_ sender ..still names all of the 
operands in an operation. This lack' of isolation can lead to 
other types of protection problen,s.' For example, a sender 
could still give inconsistent operands, e^^.... 
priority and priority queues that do not match, or specify 
the wrong registers. To solve this problem, operand names 
are isolated from the sender by being accessible only to the 
receiver. 

To provide such isolation, the receiver could decode the 
operation to find the receiver operands or simply interrupt 
thTe host "pressor (which incurs significant overhead). To 
allow sufficient flexibility in operand specification, this 
first alternative requires programmable control at the 
receiver which increases complexity. Nevertheless, a fairly 
simple implementation of this alternative is presented 
later. To support the very common case of indirection with 
postincrement, the following special case is added to the 
rsrevious primitives: 

effaddr= <addreg^>; 

addreg. •>- <addreg^> + <addreg^^j>. 

TO use'this special indirect addressing mode, the sender 
only specifies an index i in the operand. The receiver uses 
address register i for indirection and automatically uses the 
next register for the postincrement amount. 

The connection table 140. endpoint table 160. and address 
registers 170 may be allocated from memory in the network 
interface which typically is either a static random access 
memory (SRW.) or a dynamic random access memory (DR-^M) . 
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A conventional translation look-aside buffer (TLB) 180 is 
used to map. an a.ddress indicated by the message into a 
physical address (PA). The TLB is preferably a fully 
associative cache. The TLB matches;.on bits identifying the 
endpoint as well as on the virtual address (VA) since 
multiple endpoints may have the same' VA. The TLB also stores 
an indication of the access rights to' the physical address. 

The connection table 140, endpoint table 160, address 

registers 170 and TLb" IBO are interconnected in the following 
manner. The endpoint number 146 output from the connection 
table 142 is used as the input to the endpoint table 160. 
The base 148 and bounds information 149 from the connection 
table are 140 fed respectively to an adder 156 and a 
comparator 158. The adder also receives the index from the 
received AT.M cell (128 in Fig^ 7) and its^ output is also fed 
to the comparator 158. The comparator acts as a filter 
through which a valid address to the address registers 170 is 
provided, otherwise an error trap occurs. 

The endpoint table 160 has its outputs for the base 164 
and bounds 166 connected respectively to an adder 172 and 
corr.parator 174. The adder 172 also receives an offset from 
the received ATM cell (in operand field 126 in Fig. 7). A 
multiplexer 176 receives the output of the adder 172 the base 
164 from the endpoint table 160 and the offset from the 
received ATM cell. The output of the multiplexer 176 is 
applied to an input of an arithmetic logic unit (ALU) 178. 
The ALU also receives as another input a value read from the 
address registers 170. The outputs of the ALU 178 are a 
condition code which connects to the receive controller (228 
in Fig. 8) and a result which is connected to a demultiplexer 
179 and the address registers 170. The demultiplexer 179 
also receives as another input the output of adder 172. The 
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output of the demultiplexer 179 is applied to another input 
of the comparator 174. The output of the comparator IK is 
either an error or an address within the endpoint range which 
is then input to the TLB 180. 

The receive controller 228 ' is used to control the 
multiplexer 176. ALU 178, demultiplexer 179, and read and 
vrite of the address register 170- in accordance with the 
state information 150 from the connection table 140, the 
opcode/control information from the received ATM cell (130 
and 122. respectively, . in Fig. 7). and the condition code 
from the ALU 178. There is great flexibility within the 
receive controller 228 with respect to features supported and 
the implementation of these features. If only the basic five 
addressing modes described above are implemented, and because 
these modes are very simple, the system implementing them can 
be hardwired via a finite state machine. Fig. 10 shows a 
sketch of such a table-driven implementation. The sender 
controller 234 could be realized in a manner analogous to the 
receive controller 228. 

m Fig. 10. the condition code, state and opcode are 
respectively inputs 184, 186 and 188 which are used to index 
a table 190. The outputs to the table are control signals 
a92 to the multiplexer, demultiplexer. ALU and latches (not 
shown), new state information 194 and a mask 196. The use of 
these outputs will be described in more detail below. In 
Figure 9. these outputs are all subsumed by the line labelled 
control from the receive controller to the operation logic. 

Although not shown in Fig. 9. there are data paths to the 
connection table 140. endpoint table 160, address registers 
no and TLB 180 so that the host processor can read and 
write their contents. This functionality is provided so the 
operating system can maintain the tables and TLB and so 
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A^Miiare the address 
arress and manipulate 
applications can access 

.eg^s.ers. Direct appUc.ticn access to 
.e 'iste. poses a protection ptoMe» t.ou,n. .^e easiest 
oLion is to ae„. airect access an. .orce appUcation, t 

«T,iv via operating system 
access the address registers only vaa P 

" . ^AAT^ci^ reqister access 
„ns. Ho«ver. t«s solution »a.es ^ 

.^nsive. -Cce. TKati.. 

into the application virtual address (VA) spac 

.ddress registers c. each connection ate ^^^'^J, 
ai«erent physical address ra„,e. Then sc. n 
.i.cuitry could extract the connection nu^er .rca, tM 
p.,sical address and use the addtess register tase l,e an 

^aHio 140 to access the 
bounds »5 in the connection table 

•.^.rt^crprs 170. VJlth this 

.ppropriate region of the address registers 

; ution-an operating svs.e. call is retired onl. to, 
establish the capping. .he.ea.te. an application can access 
3ddress registers using the sa.e address register base 
bounds circuitry used by incoming ..essages. 

Mthough the endpoint base has been described as a VA. 
could also be a Physical add.essCPM. ^""'Z 

, P.. the capping can be si.pUned to ^rely a b^und 
chec, ho-ever incurring tvo constraints. The its 

.llocated on -ired do«i consecutive physical p., ^ - 
second constraint is that the address registers can only 
contain reUtive addresses or P.s. otherwise V. to P. »app. 
is stiU retired. The first constraint restrains use 
bost .e.ory, especially if dynamically sised endpoints are 
..ed The second constraint .eans that either an 
/ication cannot use fuU address pointers for indirect.on 
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.esirea. T.e second constraint .eans that either an 

3ppnca.... ^^^^^^^^^^ , 

or the operating system ^^^^ess 
pointer to a PA 
register . 



before storing the pointer in an address 



-I -970 will now be 

The functioning of operation logic 230 

/ w \ ^ e r^prived from the 
described. A connection number (connt.) is 

„h^ch is used to index into the connection table 
message wh.ch is used connection 
HO protection is provided by insuring that 

, valid connection 

„v»ber is within the table sl«. i.e. is 
nu^er. for example by using the cor.parator 1». 
- .he index value fr» the «H cell is added .0 the 
„S frc-^he connection table uiing adder I.e. This su™ 

,nen compared with the bounds 1« by comparator 158.^ 
,He su. is Within the bounds, it is provided as an address to 
.he address registers 1.0. Otherwise, an error trap occurs. 
, value fro» the address register is thus obtained and .ay be 

applied to the ALU 178. 

Ihe endpoint number obtained fro. the ^nnection table 
HO is u.ed to retri;^ the base Im" whichTs .dded to the 
„»set value fro. the received cell using adder 112. 

n.is su., is applied to the multiplexer ne and demultiplexer 
n. The multiplexer is controlled so as to select one o. 

epolied to the ..U 17e. The AL« is controlled to perfor. 
operations on the value received from the address registers 
„0 and the output of the multiplexer to provide a condition 
code and an address. The operations which can be performed 
,v the ALU 1,8 include, but are not limited to. operations on 
a'ddress register contents, such as adding for post-increment 
„0<,e. logical operations and comparison. e.g.. or 
Iditional interrupts. The ... can be used for various 
„her operations on the address register contents. ess 
conditional interrupts and operations on the 
agisters are to be restricted to addressing modes that 
„ use the address registers, the address register memory 
3hould be multi-ported or clocked faster than the rest o. the 
operation logic 230. 
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The result provided by the ALU 178 n.ay be applied to the 
address registers 170. The den,ultiplexer 179 is controlled 
so as to select one of the sum from the adder 172 and rhe 
result from the ALU 178.. The result is compared with rhe 
bounds 166 for the endpoint using comparator 174 to add . to 
the protection described above. 

— -The final line of protection is' finding a valid mapping 
entry in the TLB 180 for the address output from the 
comparator 174. A TLB miss, either because the desired 
address mapping is not present in the TLB or because of 
inadequate access rights, causes an exception to be delivered 
to the operating system. While handling a TLB miss via an 
exception to the host processor uses little hardware, it has 
the disadvantage of blocking further processing of incoming 
cells while th^ miss is serviced. These incoming cells 
preferably are throttled and __buf fered_ but may also be 
discarded. Alternatively the interface may either use 
hardware to service TLB misses or have dedicated mapping per 
endpoint to reduce or eliminate misses. The former choice is 
hardware intensive, inflexible, and still will delay cell 
processing because of delays in accessing the host memory for 
Lapping. information. The latter choice might be workable if 
the endpoint table is expanded to contain one or two mapping 
entries per endpoint. However, such expansion leads to a 
space and performance tradeoff. 

The receive controller 228 decodes the opcode field 129 
and uses the condition code from the ALU 178 and the state 
information 150 from the connection table 140 to determine 
what to do with the incoming ATM cell. The control signals 
output by the receive controller 228 are determined according 
to the opcode and condition code and are used to effect a 
desired addressing mode and to store the data. The mask 



„.-.PUt 19. (see Fig. ao, selects which elements, e.g. usin, . 
,.,,,e granulerity. o£ the d«. .« .ctu.Uy stcrea in .he 

taken directly from 4 bits of the opcode field. 

.r=,. 150 records connection addressing 
The connection state 15" lei-u' 

information across cells in a multiple cell message The 
arst cell in . multiple cell n^'ss^ge is a- -control- eel 
.Hat Choose, the addressing -ode and -specifies the offset and 
,„ae. This information is stored in the state field of ^ the 
connection table U. so that subse^ent cells for the same 

the control information and thus carry 
connection can omit the control 

more data. For example, the first cell could have 32 bytes 
aata and subse^ent cells could have « bytes of data 

(Perhaos -^—O- ""'^"^ 

control information stored in the connection table state 
.U^d. The end of such a data cell seguence could be 
"indicated either by storing a cell count in the state field. 

by using the standard format wherein the last cell in 

,,ch a secuence carries the length and a CRC. Every cell is 
checked until a correct CRC is found. Assuming a multi-cell 

..!,!, > CRC this scheme could transfer H 
message terminates with a CRC. tnis 

.ata blocks of 32 bytes each in 12/3(«-l)l*2 cells for N > I. 
which asymptotically achieves 3/2 the bandwidth compared to 
sending one 32 byte data block per cell. 

„any other functions as described above can be 
implemented by adding to the opcodes interpreted by the 

receive controller 228. ^ ^ 

error trap to the operating system may be handled by 
...net discarding the cell or by inserting the cell in an 
::tor gueue and signaling an exception to the operating 

syste.T.. 



BNSOOCID: <Ga 2301264A_I.> 



Having now described the receive portion of the front 
end, the send portion will now be described. Both control 
information and data are provided :to send a cell. The 
control information comes from a set of send registers 236 
which each connection has for sending cells. The first three 
registers in this set store the control information, i.e.. 
the opcode, operand, and index, for 'the cell. The data comes 
either from the endpoini" associated-- with the connection or 
from a special block of . send registers, which is useful for 
formulating read requests. 

The set of send registers 236 for each connection also 
contains a status register, a mode register and a "go" 
register. The status register contains the number of cells 
sent. If initialized to 1. it will be set to 0 when the 
InYiH-a-ce actually sends the cell. This feature is useful to 
serve -as an acknowledgn.ent that a sent cell has actually left 
the interface since a cell might not be sent right away due 
to flow control or an exception. The mode register enables 
two variants of sending which will be described below. The 
go register is used to actually cause a cell to be sent as 
will be described below. # 

AS discussed previously, numerous implementation options 
are possible for the front end. Several different 
embodiments are presented here. A first embodiment of the 
front end architecture in Fig. 9 uses minimal hardware. 
Address register operations are restricted to at most one 
address register read and write operation per cell. Endpoint 
pages are pinned in physical memory while the buffer is 
active. Remote reads are handled by the host processor, vaa 
an interrupt. The first restriction means that any 
combination of the primitive operations listed above acting 
cn address registers acts on the same address register. 
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..so. an. .inary ope.a.ions o.».n one a..u.ent fro. ..e 
. Thus the postincrement amount for xnd.rect 
message operand. Thus the P ^^^^ 

addressing operations is specified by the messag 

addressiny ^ sender. 

M„'n«iv.ly. tM aesired £m,ction.lity can be _ P 

. a th. postinct^ent awunt could 
• cemjence of messages. « g-- P° 

' "'5"^ , following message. 

soecitied via a register add 'in • « 
be speciti" ^ „-„,, large endpoints 

Although pinning the endpoint pages prevents l.rg 
Altnougn pa endpoints, these 

„a even restricts the number of small endp 

restrictions lead to a simple architecture. 

receive side of this embodiment is as descr bed 
..„„e. e.cept that, because the endpoint pages are pinned ^n 
nls architecture, the endpoint table UO contains th 
hyslcal address of the base of an endpoint. However, »nc 
Jsical pages are" not necessarilv allocated contiguouslv. 

"cache- of address translation pairs is used to .ap .he 
1 Of an endPomt base address and offset, labeled V > 
,i,u.-e , to the appropriate physical address, .or endp .n. 
.f one page or less Is sire, the P. may be stored directly 
base field 1« Of the endpoint table entry 1« ^ 

..ppmg bit IS added to all endpoint table entr.es to 

. , rh» base field. Additional mapping 
control interpretation of the base 

e„„ies could be added to the endpoint table UO 
accoitCTodate larger endpoints. 

„e operation of the send side Is as follows, .irst, t 
.end endpoints are restricted to he Integral page sizes. T 
.uthorlre sending to a particular connection without a Kerne 
Tu each outgoing connection from an endpoint has a unigu 

area at a fixed offset (in the high 
virtual mapped command area at a 

„.er virtual address bits, from the endPolnt virtual pages 
size as the endpoint. These connection co»nand 
and the same size as tne i- . The send 

-k.rt to the network interface. Tne 
paces ma? uncached to 
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re,isters 23e to. the connection are .apped Into the p.9« 
j..st belo- the base o£ the connection command region- 

Tc send a block of data at an offset fro„ the base of the 
.ource endpoint to the network via^ connection C. a write 
operation is perforn«d to the location at the offset 
connection C-s con^and base. • The value written is 
ignored. After ..pping, the low order physical address bits 
.re the physical address (or an offset, as described later, of 
.ne data block in the endpolnt to send and the high order 
physical address bits contain the connection nun±,er. The 
send controller extracts the control information fro». the 
send registers 236 of the connection and reads the data block 
to compose a cell. Alternatively, the data can come from a 
soecial block of send registers 236 as discussed above. 
■^Kith so.e host-procesrors-there-.ay-not-be .sufficient 
,<ts to encode both the physical address of the data block 
.nd the connection- number, as-w.ll as the send controller 23, 
.adress. into a physical address. Two alternatives are 
possible in this such systems. The first alternative is to 
replace the physical address of the data block with the 
o»set fron, the enapoint base. This requires the base 
..eress Of an en.point to be accessible to the send 
controller 23, and endpoint physical pages preferably to be 

Stored contiguously. 

The second alternative is tc use a right-shifted version 
Of the physical address of the data block. A very convenient 

Addressed at 32 byte granularity, the least significant five 
Mts of the Physical address are unused anyway, assuming a 
,y.e-addressed machine. This shifting frees five bits in the 
,.,sical address for encoding the connection number and send 
ontroller 23, address. However, this right shifting has 
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.epercussior... Since the pa^e offset is not chan^ea by 
n,.»ory ..ppin,. the endpolnt co™..nd virtual address 
the data bloc, physical address are Un^.^ ^ the 
^oUowin, constraint: the five .est significant bits in the 
e,dpoint co.»and virtual pa^e offset correspond to the fave 
;e,3t Significant bits of the data bloc, physical pa^e 
nu^er.. This constraint, has three conseguences. F.rst. the 
endpoint co:».and region- is a factor'-of 32 smaller th«, the 
endpoint: each entry in the co™.and region now ™ps to the 
b.se of a data bloc.. Conse^ently. each blocK of contiguous 
32 pages in the endpoint has the .me memory protection, 
second, endpoints are multiples of 32 pages in site. Th.s 

K» r.l.xed if the send controller 234 can 

constraint can be reiaxeo iJ. 

access the -endpoint base and si.e. Third, endpoints are 
..,Pped to contiguous chunHs- of 32 pages- aligned «tK a 32 

page nenory boundary. 

The mod. register enables t.o variants of the send 
procedure, m the first, the operand is ta.en from the value 
ac-ually written to the connection co^nand region. The 
second causes an exception if an attempt is made to send vhen 
r.-,e status field is non-zero. The "go" register is not used 

in this embodiment. 

«y state operated on by the cell. e.g. -rite to address 
-egisters, is not updated until after the point of the last 
possible exception point, a TLB miss, for that cell. The 
cell is retained in the input FIFO and processing of further 
„lls from that connection is blocKed until re-enabled by the 
Host processor: cells are only removed from the input FIFO 
...en a cell "co^^its" after the last exception point. 

:n the second embodiment, the three restrictions of 

K .,^™.„t are removed to obtain three major 
first embodiment ate 

no winning of endpoint pages, remote reacs 
enhancements: no pinning 
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without interrupting the host processor, and a richer set or 
register and conditional operations. To support this first 
enhancement. the endpoint table : 160 contains virtual 
addresses and the TLB maps from virtual addresses to physical 
addresses. This enhancement also introduces a new category 
of exceptions: page faults from references to paged out 
endpoint pages. These are treated as another class of 
lexceptions. and " are., serviced by thk.host processor, which 
maintains the main virtual mapping tables. The operating 
system is now responsible for keeping the mapping information 
consistent with the host memory state. 

TO handle remote reads without the interrupting the host 
processor, the address mapping for the read data is 
computed. TO do so. the receiver side, which does mapping, 
is reused for sending,-.- i • e .-=^the_:pperation-._ logic 230 is 
n^ultiplexed between the receiver side and the sender side, 
in this embodiment, in fact, the -operation logic 230 is used 
for all sends and not just sends requested by remote reads, 
consequently it is no longer necessary to use the virtually 
n-.ap?ed corr.-nand pages for protection. However, in this 
erJt>odiment. the virtually mapped send registers are 
retained. To send a block of data at an offset from the base 
of source endpoint. the offset is written into the "go" 
register for the appropriate connection attached to that 
endpoint. This write causes a data block at the offset. 32 
byte block aligned, from the endpoint base to be read and 
composed into a cell with the control information in the 
opcode, operand, and index registers and sent via the 
associated connection. A multiple send mode is also added in 
which the status register is set to the na^er of cells to 
send starting at the specified offset. 
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,„ this e»bodin,ent. three ptin,itlve operations c.a t,e 
executed per ««ssa,e: .ddress generation, register 
ooeratics, and conditionals. * main opcode controls the 
selection and ordering ot the primitive operations. Example 
opcodes are .read, read multiple, write, write multiple, and 
software exception which causes an interrupt to the host 
processor. The instruction format allows up to three 
different register .operands, to be 'n^ned in addition to an 
inu^ediate operand. To accommodate all the accesses to ^ the 
address register 170. the register operations are .11 triple 
clocked. Due to potential side effects, state recovery after 
exceptions is more complicated. 

It is sometimes preferable to have greater flexibility in 
control functionality and cell interpretation than may be 
provided by the first- two embodirnents. For example, 
operations for locking or svap-and-conpare n,ay be desired. 
Also, it might be useful to customize the cell level protocol 
for certain applications. Ultimately, flexibility could be 
available in the form of full programmability. which is 
alwavs a tradeoff between complexity and cost. Two ways to 
add flexibility to the previous embodiments are to make the 
receive and send controllers 228 and 234 writable, and to add 
a programmable finite state machine for cell interpretation. 
The first non-header word of an ATM cell may index into a 
v-itable control memory that interprets the remaining fields 
cf the cell and sets all the control signals. This control 
memory might have a number of co^on cell interpretations 
^a-dwi^ed in and a nur^er of progran^able ones, perhaps even 
connect ion-dependent interpretations . However . a 

micro-processor should also be considered for this purpose. 

,,coc a conventional 
Thus, a third embodiment uses a 

in p"ect a communication co-processor, for 
.T.icroprocessor . in e--ect « 
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• . front end The microprocessor simplifies the 
the entire front ena. 

= n of the internal control logic of the 
hardware since all of tne 

..croprocessor is leveraged and its own TLB can be used as 

TLB :80. The endpoint nu^er .ay be incorporated .n the 

K^rc of the VA. Also: fully programmable cell 
high address bits of tne 

interpiet.-aon ..d control m.y b. obfined. 

- .„Ko„,. the third «*odi».nt provide, flexibility, even 
•T^"r55«bps. it .ay be difficult to use . mcroprocessor for 
elrythlng. It .a.es .note sense to use . microprocessor to 
interpret hl^h level operations. For example. a 
Microprocessor could be particularly effective for relieving 
.he host processor of responsibility for read operations, 
such a con»unicatlons co-processor could stlU benefit from 
hardwired address generation '^"'^^'"^ " 
reduce the computation burden'. 

fourth embodiment uses a programmable cell Interpreter 
,nd so IS a hybrid of the harJ-Oiri- -c6ntrol of the first two 
..^odi^ent, and the microprocessor control of the third 
e=^odimer.t. Any complex decoding or operations are handled 
either a host processor or a co-processor. In thrs 
embodiment. the various functional units retired . are 
p-eferably Implemented on a programmable gate array, usrng 
B^, for the various tables. This basic functionality may 
even function as a front-end for a co-processor so that the 
co-processor would have a reduced load. This can be very 
helpful to realize both high performance and flexibility at 
hioh date rates like SS2Mbps or 1.2Gbps. 

.'^ere possible, these embodiments could easily be 
pi.ellnea for high performance. However, at 
ra'tes, bytes of data arrive at about sonsec Intervals, -K.c 
..ould be long enough to complete the address and contro 
3..UP. It may be necessary to buffer the data for about a 
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vcre or so to satisfy nemcry hierarchy access times. At 
622Ml=;=s only a few pipeline stages should be necessary. 

« described above, any of these embodiments for the 
nont end .10 is also connected to a -bacR end" 21, of the 
network interface for connecting the front end to the .e«ry 
of its host. Three e»bodin.ents for 'such a b.cK end will now 
be described. 

-..._.I.„ a -f.ir.st .ena>odi,»ent. the front .end may connect directly 
„ the n,ain n,e»=ry bus. A cache controller snoops this bus 

to ensure coherency. 

Alternatively, while a direct memory connection is 
attractive perf or.ance-vise. it is only an option to computer 
builders. An I/O bus interface such as the PCI bus would be 
accessible to a far greater market. The disadvantage of nost 
I/O buses, including the PGI bus^is delay in gaining- control 

J » ^« activity of other bus devices, thus 

of the bus due to the activity 

J K.vf-f-o'ina is used which adds to 
some decree of on-board buffering is 

latency. 

AS another alternative, a direct cache interface could be 
„,ed which does not require processor modifications. In this 
embodiment the network connects directly to external 
p-ocessor cache. This direct coupling of the network to a 
cache may reduce the copying of message data. To avoid 
unduly diluting this cache with network traffic, and thus 
„ecatively impacting the performance of the processor to 
wh^ch it connects, another embodiment couples the network to 
a separate message cache 252 as shown in Fig. H- 

in Fig. 11. a message cache 252 is connected to the 
nicro-processor 250 via a bus 256. The message cache 252 is 
connected to data cache 25. and main (or host) memory 
<„„t sho.,) via a bus 258. A mapping unit leo connects to 

oro and r o the network, 
the message cache via bus 262 and uO tne 
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This message cache 252 is £uUy integrated into the 
mer.ory hierarchy (as shown by connections 256 and 258). so 
.here is no need to copy data from a message buffer to the 
.nexory hierarchy before a process can access the data. The 
interface may be implemented at the secondary cache level, 
and thus no expensive, special purpose, or custom processor 
modifications are required. By restricting the data size of 
messages to be equal to^ the cache block size. e.g.. 32 bytes, 
cache blocks can be updated atomically. eliminating 
complicated and slow circuitry for updating partial c'ache 
blocks. 

A problem in this direct . cache interface in Fig. 11 is 
maintaining coherency between the message cache 252 and cache 
254 using a low overhead mechanism. • To solve this problem, 
the magnitude of the . incoherency ..problem -is reduced by 
allowing only the network to write into the message cache 
252. Then each write into the message cache 252 checks for 
and invalidates any blocks in the data cache 254 with a 
matching tag. The impact of this checking on the data cache 
254 is minimized by performing the checks on a shadow copy of 
the da.a cache tags. Further details on such a direct cache 
interface may be found in a copending U.S. patent application 
entitled "Low Latency Network Interface", filed November 16. 
1993 by Randy B. Osborne. 

Numerous modifications and variations to the embodiments 
of the network interface can be made. For example, exception 
handling and flow control can be integrated in the front-end. 
for the following reasons. Exceptions caused by error traps, 
protection violations, unimplemented operations. TLB misses, 
and page faults, e.g.. host memory page faults, and possibly 
co.nnection and endpoint table page faults, can slow cell 
processing in this system and thus lead to a flow control 
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prcie.. cell processing can also be blocKed to ensure 
atomicity during host processor accesses to this system 
.en,ory structures. For error traps, this problem can be 
avoided entirely by immediately discarding the offending 
cells or pushing it into a buffer overflow problem by putting 
such cells on an exception queue ■ for examination at the 
convenience of the operating system. Of course, discarding 
the Offending cells will not work 'for the other exception 
types because even if the offending cell is discarded and an 
implicit or explicit "retry- message is returned • to ' the 
sender, the exception condition still has to be repaired 
before forward progress can be made. In the meantime, 
incoming cells are discarded, buffered, or throttled. 

However, this only applies to incoming cells belonging to 
the same connection affected by the . exception condition, 
cells belonging to other connections can be processed once 
the exception condition is saved. Cells belonging to the 
exception incurred connection cannot be processed, even if 
they are not affected by the exception, i.e.. they do not 
cause a TLB miss or page fault, since some applications may 
depend on the guarantee of sender ordering of ATM cells. 
Since this ordering guarantee is per channel, which maps to 
connection in this system, it is acceptable to continue 
processing cells which belong to different connections but 
share the same endpoint as an exception incurred cell. Of 
course, these cells may cause an exception. 

A strategy for dealing with exceptions is provided will 
now be discussed. First, the offending cell is removed from 
the front end as guic.ly as possible. Cells causing error 
t-aps can be discarded or queued. Other cells are retained 

the input FIFO. The connection is then marked as 
exceotioned. Flow control is invoked next to throttle 



senders transniitting further cells for that connection. In 
the meantime, any further cells which arrive for that 
connection are buffered, whereas other connections may 
continue processing cells. Only the- first step is necessary 
for error traps. Global "exceptions", like blocking the 
system during host processor accesses, require throttling and 
buffering across all connections.' The throttling and 
buffering, are compatible with credit-based flow control 

schemes . , 

Hybrid address mapping could be used as an alternative to 
a large global TLB for mapping. That is. each endpoint 
buffer table entry could contain one or two mappings and the 
TLB could contain the rest. This hybrid mapping is a 
generalization of the idea, discussed above in connection 
with the first front end. -embodiment-, -of-inser ting the-mapping 
for single page endpoints directly in the endpoint table. 
The mappings in the endpoint- table-entries could - be managed 
by the application so as to always contain mappings likely to 
be used soon. 

Locality in the connection and endpoint tables can also 
be managed. That is, with many connections and large 
endpoints. the respective tables might be large in size. One 
idea to reduce the table sizes required is to exploit 
locality by essentially "paging" out entries from the 
tables. On a page fault when accessing these tables, the 
interface can remove the page faulting cell from the cell 
processing pipeline and buffer and flow control further cells 
for the page faulted connection as described above. The host 
processor can then perform the actual "paging" of the tables, 
restore state, and restart cell processing for the faulted 
connection . 
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structures can also be added to provide fair network 
access and to prevent network deadlock. That is, processes 
are prevented from interfering with each other either by 
blocking the network or by congesting the network and 
inducing deadlock in other processes. To ensure fairness 
some form of admission control could" be provided that limits 
the duration that one connection can send 1:0 the network if 
other -connections have pending, non-flow controlled traffic. 
For performance reasons it is also a good idea to give 
priority to operating system traffic. Preventing deadlock 
requires several steps. First, each connection has 
independent flow control. Independent flow control per 
VCI/VPI if fairly standard in ATM networks and interfaces, 
second, any global exceptions that block processing of all 
cells -have bounded duration. Third, -ceM-s- requiring a 
response from the network are removed even though the reply 
connection may be flow controlled. This removal is made 
possible by reserving some buffer capacity for reply traffic, 
such as by allocating a separate VCI/VPI with associated flow 
control buffers, per connection just for reply traffic. 
Admission control should favor reply traffic over new traffic. 

To ensure at least the operating system can always make 
progress, it should also have its own connection. Any pages 
that the operating system might use, such as in the 
connection and endpoint tables and address registers, should 
be locked down to prevent unbounded delay due to page 
thrashing. 

Global address registers may also be used in the front 
end. in the embodiments described above, address registers 
are currently private to each connection. This could make it 
inccnvenie.ot for several different connections to the same 
encooint to share a com..on queue. One way to solve this 



BNSCXXIO: <GB 2301264A_».> 



, . rooisters that are 

.0 ..a . n^e. or ^^^^^ 

^ • . rarher than being strictly 
Global to the endpomt. rather 

' . . ^ rable could contain base and 

connections. The endpomt table 

^n- rhe same memory as the 
bounds to such global registers m.-the same 

i 

inral registers. 

„^«aUon .u»e» can .Uo proviaea in ..e ..=n. 
a is, for «.ulU-=en ^=s„es. ..«hi., .8 byte 

.. ^ ~ 

sizes leads to fragmentation problems. For examp 

Ko 16 bvte fragments. Both the 
We blocks, there will be 16 byte x 9 , 

• , m»v keep fragmentation buffers to 
sender and the receiver . may keep tragm 

cor.hie 32 byte blocks, or whatever the 
fragment and t.hen reassemble 32 byte 

K Hi^.k size is. However, it highly likely 
memory and cache block size 

•Kip to leverage whatever segmentation 
that it will be possible to leverag 

— ».= ^v"alreadv iTl^T "a "standard high 
and re£ssem±>ly support there is already is 

..ndwidth AT.M interface for this fragmentation purpose. 

..e se..de. and" receiver addresses" also may hot be aligned 

„ «r cache block. Assuming that 
with respect to a memory or cache 

<..,3:i,n.en. is ..e S3.e . tne s..ae. .na «c«ve. an 
.U,nea b.oc. can be sent ana the portions that shouia not be 
L Lea can be .ere^v »as.ea out. Tbis .av not >eaa to t.e 
utnuat.on o. cens. but sucb .isaU,n„e„t is ii.e.y to 
rare. <.otber possibU performance penalty is tb.t some 
»cbitectures may restrict subbioc. .aaressin,, so it may e 

necessary t. ^. „ „r .;ratter. operation 

» mask The masking, or scatter, uf 
selectively under a mask. in« 

.3sHn,, but Without a ,atHer capabiiity it is optional at 

: out on the network but it aoes not improve efncency. 
" L some cases the senain, aaaress may be unaU,nea bot 
„ b.oc. bounaaries ana .itb respect to tbe rece.e. 



BNSOOCIO: «GB_2301264A_I. > 



• »rfdition to masking, is required ro 
add-ess. Shifting, in addition 

Hiem TO solve this problem. align„.ent 
deal with this problem. to 

^h^fzers can be added. . 

.Id also be. used in the front end 
A prefetch queue could aiso 

, ■■^■Air.n latency. The system 
Keca«« prefetching U useful for h.d.ng latency 

c. J -ve .ctu.U, aueaa, s.ppott. 
p.oviae complete ptefetcMng. • . P-ce«ot that c 
-IJate-pt-efetcHes, sue. « t.e Pe6...Mp... - na 

. HO prefetch operations from this 
3o„e nat.«te t..t can aecoae pt fetch P 

p„cessor ana tut„ th.»..nto tea fetches 
not pcssible to put such haraw.re on 
,«etface: .ut it ccuia c.ct.inlv te a»e. as it was aone 

, .K. d-t. bus. Proviaea that a fetch can 
connects airectly to the a.ta Bus. 

return aadress of the fetch is 
Be initiatea -iS-some «y, the return 

„P,y Ustea as the tail of a ^eue. The cona.t.onal 

Jerlupt sch.». couia even he .usea to inaicate when the 

mipue is nonempty. 

" .>.o. in see applications, it aesirahle to prov.ae 

ereacer isolation between the senaer ana the receiver. 

,n airect deposit .oael presentea above, the senaer 
,Uces an instruction ana operands airectly into a message. 
e:ands .ay be i^ediate. or reference so»e state Cm the 
> • receiver. The receiver executes 

add-ess registers) in the receiver 

AS aescribed earlier, access restrictions 
the instruction. *s aesctio<=" 

provide some degree of protection 
on address registers pro ^ ^^^^^^^^ 

and isolation. However, there 

isolation for some applications due to the fact 

.me all the instruction operands, even if it 

p,oblen 

.on- access those operands. One way to 
" r ro inte-rupt the host processor to perform actions 
is si.ply to inte.r .,,,,„i„g .yste... 

reguiring isolation, as in recel 
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Another way to solve this probler, is to allow the axrect 
instruction (operation) in a message to be replaced by a 

receiver (an instruction 
pointer to an instruction in the receiver 

pointer). The instruction .ay directly reference receiver 
operands, e.g. in the address registers, without Knowledge of 
the sender. Messages can still provide inunediate operands 
and name receiver .operands, though -the receiver may choose 
,,ot to use these operands, as may_.be necessary to maintain 

isolation. , 

one simple embodiment of this solution will now be 
p.esented. The principle modification to the receive side as 
described to this point is to add an instruction and operand 
buffer area as shown in Figure 13. The connection table 140 
now contains base and bounds entries 300 and 302. similar to 
.hat for address registers, for-access to an..instruction 
.nemory 304. To keep the scheme simple, each instruction is 
co..posed of an operation and operands in exactly the same 
format as in an ATM cell as described earlier. The operation 
controls fro. which location - the instruction memory 304 in 
the receiver or the message - operands are taken. Protection 
bits, similar to those for the address registers 170. allow 
the receiver to control which instructions serve as entry 

points to the sender. 

Three are at least enhancements to this scheme. The 
I^rst is global instructions. Many connections ate lively to 
.nare the same operations, though on different operands. To 
accommodate this expectation, a capability is added for 

► ' These are instructions that are 
global instructions. These «i« 

•Kn= arrnss all connections . They are 
globally accessible across an 

constrained to operate only on sender-supplied operands to 
„inimi.e the difficulty of operating on different receiver 
operands. 



The second enhance«>ent is separate instruction and 
operand meniory. Instruction and operand r^emory could be 
separated at the cost of complexity to. save memory storage. 

The third enhancement is providing multiple instructions 
per cell. A sequencer can be added to step the receiver 
through several instructions per cell. The first instruction 
in such a sequence serves as the entry point. 

It is easy to further add ', conditional sequencing 
operatloniT-iubroutines. etc. However, to keep the interface 
simple, more complex functionality is best obtained by 
trapping to the host processor or by adding a co-processor, 
such as a micro-processor. 

Having now described a few embodiments of the invention, 
and some modifications and variations thereto it should be 
apparent to those skilled in the art that_ the foregoing is 
merely illustrative and ' not"limi ting, having been presented 
by way of example only. Numerous modifications and other 
erriodiments are within the scope of one of ordinary skill in 
the art and are contemplated as falling within the scope of 
the i.-.vention as limited only by the appended claims and 
equivale.-ts thereto. 
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CLAIMS 



1 A lov overhead eviction system for . computer system 
having . sender *nd a receiver interconnected by a netvorX. 
therein the receiver -has a networ* 'interface connected to . 
processor controHed by an operating system. the 
communication system comprising: 

means.- at the-sender,_ fox sending to the receiver a 
message which includes an operand, an indication of an actxon 
to be performed and a reference to information stored at' the 
receiver; 

means, in the network' interface, for receiving the 

message; 

means, in the network interface and operative in response 
to receipt of the message, for determining whether the action 
is permitted to be performed at the receiver: and 

means, in the network interface and operative when the 
action is permitted, for performing, separate from the 
processor and operating system- in the receiver, an operation 
on the operand in the message and the information stored at 
the receiver and for performing the action according to the 
operation. 

2. The communication system of claim 1. wherein the operand 
indicates an address in a memory in the receiver and wherein 
the means for performing an action includes means for 

a 1 oration in the memory in the 
depositing the message at a location 

artdress in the message and the 
receiver according to the address 

information stored at the receiver. 

•3. The communication system of claim 2. wherein the operand 
in the message indicates an address register in the network 



interface storing an address in the receiver and the 
information stored at the receiver is the address stored in 
the address register, the con«nunication system further 

comprising: 

means for obtaining the a'ddress from the indicated 

address register; and ' 

means for storing" the message .-in the memory at the 

obtained address. 

4. The communication system of claim 3, wherein the ope'rand 
further indicates and offset and the system further comprises 
the means for updating the address in the address register by 
the offset. 

5. The communication system of claim 1. wherein "the means 
for performing an action comprises: 

means for comparing the operand to the information stored 

at the receiver; and 

means for generating an interrupt to the processor when 
the step of comparing indicates that the message requires 
immediate action. 

6. The communication system of claim SV further^''"c6mprising 
means for updating the information stored at the receiver. 

7. The communication system of claim 5, further comprising 
means for queueing the message when the means for comparing 
indicates that the message does not require immediate action. 

8. The communication system of claim 5. wherein the operand 
in the message indicates an offset and an address register in 
the network interface storing an address in the receiver and 
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the information stored at the receiver as tne address stored 
in the address register, the communication system further 
comprising: 

means for obtaining the address from the indicated 
address register; and 

' ' 'means for storing the message in' the memory at the offset 
from the obtained address. 



9. A low overhead communication system, comprising: ^ 

a sender having a processor which requests that messages 
be sent and a network interface connected to the processor 
which forms a message, in response to a request from the 
processor, containing an operand, an indication of a desired 
action and a reference to information stored at the receiver; 

a receive^^^having a processor_controlled_ by_ an_operating 
system and connected to a network interface; 

a network having connected between'the network interface 
of the sender, and the network interface of the receiver and 
being adapted to communicate the message between the sender 

and the receiver; 

wherein the network interface of the receiver receives 
the message and which performs the action indicated by the 
message according to the operand and the information stored 
at the receiver only if the action is permitted to be 
performed by the receiver, 

10. The communication system of claim 9. wherein the operand 
indicates an address in a memory in the receiver and wherein 
the action is depositing the message at a location in the 
memory in the receiver according to the address in the 
message and the information stored at the receiver. 
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n. The communication system of claim 10, wherein the operand 
in the message indicates an address register storing an 
address in the receiver and the information stored at the 
receiver is the address stored in the address register, the 
communication system further-comprising in the receiver: 

means for obtaining the addrfess, from the indicated 



address register; and — 

means for storing the message .-.in the memory at the 

obtained address. ^ 

12. The communication system of claim 11, vherein the operand 
further indicates and offset and the communication system 
further comprises means for updating the address in the 
address register by the offset. 

13. The communication system of claim 9, vherein the receiver 
includes means for comparing the operand to the state 
maintained by the receiver, and means for generating an 
interrupt when the means for comparing indicates that the 
message requires immediate action. 

14. The communication system of claim 13, further comprising 
means for updating the information stored at the receiver. 

15. The communication system of claim 13, further comprising 
means for queueing the message when the means for comparing 
indicates that the message does not require immediate action. 

16. The communication system of claim 13, wherein the operand 
in the message indicates an offset and an address register 
storing an address in the receiver and the information stored 
at the receiver is the address stored in the address 

TO 



BNSDCXIID: <GB 2301264A_I.> 



register, the communication system further comprising, in the 
receiver: 

means for obtaining the address from the indicated 

address register; and 

= means for storing _the_mes sage in the^nvemory at the offset 

from the obtained address. 

17. A method for low overhead communication in a computer 
system having a sender and a receiver interconnected by a 
network, wherein the receiver has a network interface 
connected to a processor controlled by an operating system, 
the method comprising the steps of: 

sending a message from the sender through the network to 
the receiver, wherei n the mess age includes an operand, an 
indication of an action to be performed and a reference to 
information stored at the receiver; 

receiving the message at the receiver; 

insuring that the action to be performed for the sender 
is permitted at the receiver; and 

if the action is permitted, performing, separate from the 
processor and operating system, the action at the receiver 
according to the operand in the message and the information 
stored at the receiver. 

18. The method of claim 17. wherein the operand indicates an 
address in a memory in the receiver and wherein the step of 
performing an action is the step of depositing the message at 
a location in the memory in the receiver according to the 
address in the message and the information stored at the 
receiver. 
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15 The n«thod o£ cUin. 18. "herein the operand « the 
message indicate, an addre.s register storing an address in 
the receiver and the information stored at the receiver .s 
.,e address stored In the ^dress register, the .^thod 
father compris"ing~the steps of :- 

obtaining . the address from 'the indicated address 

register; and 

storing the message in the-i»»oiy..t tSe-oBtained address. 

,0 The method of claim 19. <*ereln the operand fur'ther 
indicates and offset and the communication system further 
comprises the step of updating the address in the address 

register by the offset. 

21. The method of claim 17. -herein the step of performing an 
action comprises the steps of: 

comparing the operand to the information stored at the 

receiver; and 

generating an interrupt when the step of comparing 
indicates that the message requires immediate action. 

22. The method of claim 21. further comprising the step of 
updating the information stored at the receiver. 

23 The method of claim 21. further comprising the step of 
^ueueing the message when the step of comparing indicates 
that the message does not require immediate action. 

2. The method of claim 21, wherein the operand in the 
message indicates an offset and an address register storing 
an address in the receiver and the information stored at the 
receiver is the address stored in the address register, the 
com...unication systenl further comprising the steps of: 

"72. 



obtaining the address from the indicated address 
register; and 

storing the message in the memory. at the offset from the 

obtained address. 



25. A communication system for low overhead communication in 
a netwo£k_of multi-use r computers, incl uding a sender and a 
receiver, wherein the receiver includes a processor " and a 
n»emory and wherein a portion of the memory is assigned to an 
application program being executed by the processor/ the 
communication system comprising: 

means, in the sender, for sending across the network to 
the receiver a message which includes an operand, a reference 
to information stored at the r eceiver, a nd data f_or _de_livery 
to the application program; 

means, in the receiver and operative when a message is 
received, for directly depositing the data in the message at 
a location in the portion of memory assigned to the 
application program, wherein the location may be independent 
of locations of previous message data is determined according 
to the state of the sender and the information stored at the 
receiver; and 

means, in the receiver and operative - when a message is 
received, for conditionally generating an interrupt to the 
processor according to the state of the sender and the 
information stored at the receiver. 

26. The communication system of claim 25. further comprising: 
means, in the receiver, for preventing * access to the 

information stored in the receiver when the sender is not 
authorized by the receiver to access the information. 
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27. The communication system of claim 25. further comprising: 
means, in the receiver and operative when a message is 

received, for changing the information stored in the receiver 
according to the state af the sender . 

28. A network interface for a communication system for low 
overhead communication in a network 'of - mult-i-u«er-computers, 
including a sender and a receiver, wherein the receiver 
includes a processor and a memory and wherein a portion of 
the memory is assigned to an application program being 
executed by the processor, wherein the sender sends across 
the network to the receiver a message which includes an 
operand indicative of a state of the sender, a reference to 
information stored at the receiver, and data for delivery to 
the application program, the network interface comprising: 

means, operative when a message is received, for directly 
depositing the data in the message at a location in the 
portion of memory assigned to the application program, 
wherein the location may be independent of locations of 
previous message data and is determined according to the 
state of the sender and the information stored at the 
receiver and can be independent of locations used for storing 
previous messages; and 

means. operative when a message is received, for 
conditionally generating an interrupt to the processor 
according to the state of the sender and the information 
stored at the receiver. 

29. The network interface of claim 28, further comprising: 

means for preventing access to the information stored in 
the receiver when the sender is not authorized by the 
receiver to access the information. 

Is- 
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30. The network interface of claim 28, further comprising: 

means, operative when a inessage is received, for 
changing the information stored in the receiver according to 
the state_p_f.jthe gender.. _ 

31 • A low overhead communication system constructed and 
arranged to operate substantially as hereinbefore described 
with reference to and as illustrated in Figures 5 to 13 of 
the accompanying drawings. 

10 

32. A network Interface for a low overhead communication 
system constructed and arranged to operate substantially as 
hereinbefore described_with reference to and as illustrated 
in Figures 5 to 13 of the accompanying drawings. 

15 

33. A method for low overhead communication substantially 
as hereinbefore described with reference to and as 
illustrated in Figures 5 to 13 of the accompanying drawings. 
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